linux

mirror of https://github.com/hardkernel/linux.git synced 2026-04-02 11:13:02 +09:00

Author	SHA1	Message	Date
secuflag	b8154d6f54	Revert "block: cgroups, kconfig, build bits for BFQ-v7r11-4.5.0" This reverts commit f6532c1fec5fa35626b08dfcc1d5705af96be3e7.	2023-05-16 08:39:45 +09:00
Paolo Valente	48f3ff96f2	block: cgroups, kconfig, build bits for BFQ-v7r11-4.5.0 Update Kconfig.iosched and do the related Makefile changes to include kernel configuration options for BFQ. Also increase the number of policies supported by the blkio controller so that BFQ can add its own. Signed-off-by: Paolo Valente <paolo.valente@unimore.it> Signed-off-by: Arianna Avanzini <avanzini@google.com> block: introduce the BFQ-v7r11 I/O sched for 4.5.0 The general structure is borrowed from CFQ, as much of the code for handling I/O contexts. Over time, several useful features have been ported from CFQ as well (details in the changelog in README.BFQ). A (bfq_)queue is associated to each task doing I/O on a device, and each time a scheduling decision has to be made a queue is selected and served until it expires. - Slices are given in the service domain: tasks are assigned budgets, measured in number of sectors. Once got the disk, a task must however consume its assigned budget within a configurable maximum time (by default, the maximum possible value of the budgets is automatically computed to comply with this timeout). This allows the desired latency vs "throughput boosting" tradeoff to be set. - Budgets are scheduled according to a variant of WF2Q+, implemented using an augmented rb-tree to take eligibility into account while preserving an O(log N) overall complexity. - A low-latency tunable is provided; if enabled, both interactive and soft real-time applications are guaranteed a very low latency. - Latency guarantees are preserved also in the presence of NCQ. - Also with flash-based devices, a high throughput is achieved while still preserving latency guarantees. - BFQ features Early Queue Merge (EQM), a sort of fusion of the cooperating-queue-merging and the preemption mechanisms present in CFQ. EQM is in fact a unified mechanism that tries to get a sequential read pattern, and hence a high throughput, with any set of processes performing interleaved I/O over a contiguous sequence of sectors. - BFQ supports full hierarchical scheduling, exporting a cgroups interface. Since each node has a full scheduler, each group can be assigned its own weight. - If the cgroups interface is not used, only I/O priorities can be assigned to processes, with ioprio values mapped to weights with the relation weight = IOPRIO_BE_NR - ioprio. - ioprio classes are served in strict priority order, i.e., lower priority queues are not served as long as there are higher priority queues. Among queues in the same class the bandwidth is distributed in proportion to the weight of each queue. A very thin extra bandwidth is however guaranteed to the Idle class, to prevent it from starving. Signed-off-by: Paolo Valente <paolo.valente@unimore.it> Signed-off-by: Arianna Avanzini <avanzini@google.com> block, bfq: add Early Queue Merge (EQM) to BFQ-v7r11 for 4.5.0 A set of processes may happen to perform interleaved reads, i.e.,requests whose union would give rise to a sequential read pattern. There are two typical cases: in the first case, processes read fixed-size chunks of data at a fixed distance from each other, while in the second case processes may read variable-size chunks at variable distances. The latter case occurs for example with QEMU, which splits the I/O generated by the guest into multiple chunks, and lets these chunks be served by a pool of cooperating processes, iteratively assigning the next chunk of I/O to the first available process. CFQ uses actual queue merging for the first type of rocesses, whereas it uses preemption to get a sequential read pattern out of the read requests performed by the second type of processes. In the end it uses two different mechanisms to achieve the same goal: boosting the throughput with interleaved I/O. This patch introduces Early Queue Merge (EQM), a unified mechanism to get a sequential read pattern with both types of processes. The main idea is checking newly arrived requests against the next request of the active queue both in case of actual request insert and in case of request merge. By doing so, both the types of processes can be handled by just merging their queues. EQM is then simpler and more compact than the pair of mechanisms used in CFQ. Finally, EQM also preserves the typical low-latency properties of BFQ, by properly restoring the weight-raising state of a queue when it gets back to a non-merged state. Signed-off-by: Mauro Andreolini <mauro.andreolini@unimore.it> Signed-off-by: Arianna Avanzini <avanzini@google.com> Signed-off-by: Paolo Valente <paolo.valente@unimore.it> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Turn into BFQ-v8r7 for 4.9.0 CHANGELOG from v8r4 to v8r7 BFQ v8r7 BUGFIX: make BFQ compile also without hierarchical support BFQ v8r6 BUGFIX Removed the check that, when the new queue to set in service must be selected, the cached next_in_service entities coincide with the entities chosen by __bfq_lookup_next_entity. This check, issuing a warning on failure, was wrong, because the cached and the newly chosen entity could differ in case of a CLASS_IDLE timeout. EFFICIENCY IMPROVEMENT (this improvement is related to the above BUGFIX) The cached next_in_service entities are now really used to select the next queue to serve when the in-service queue expires. Before this change, the cached values were used only for extra (and in general wrong) consistency checks. This caused additional overhead instead of reducing it. EFFICIENCY IMPROVEMENT The next entity to serve, for each level of the hierarchy, is now updated on every event that may change it, i.e., on every activation or deactivation of any entity. This finer granularity is not strictly needed for corectness, because it is only on queue expirations that BFQ needs to know what are the next entities to serve. Yet this change makes it possible to implement optimizations in which it is necessary to know the next queue to serve before the in-service queue expires. SERVICE-ACCURACY IMPROVEMENT The per-device CLASS_IDLE service timeout has been turned into a much more accurate per-group timeout. CODE-QUALITY IMPROVEMENT The non-trivial parts touched by the above improvements have been partially rewritten, and enriched of comments, so as to improve their transparency and understandability. IMPROVEMENT Ported and improved CFQ commit `41647e7a` Before this improvememtn, BFQ used the same logic for detecting seeky queues for rotational disks and SSDs. This logic is appropriate for the former, as it takes into account only inter-request distance, and the latter is the dominant latency factor on a rotational device. Yet things change with flash-based devices, where serving a large request still yields a high throughput, even the request is far from the previous request served. This commits extends seeky detection to take into accoutn also this fact with flash-based devices. In particular, this commit is an improved port of the original commit `41647e7a` for CFQ. CODE IMPROVEMENT Remove useless parameter from bfq_del_bfqq_busy OPTIMIZATION Optimize the update of next_in_service entity. If the update of the next_in_service candidate entity is triggered by the activation of an entity, then it is not necessary to perform full lookups in the active trees to update next_in_service. In fact, it is enough to check whether the just-activated entity has a higher priority than next_in_service, or, even if it has the same priority as next_in_service, is eligible and has a lower virtual finish time than next_in_service. If this compound condition holds, then the new entity can be set as the new next_in_service. Otherwise no change is needed. This commit implements this optimization. BUGFIX Fix bug causing occasional loss of weight raising. When a bfq_queue, say bfqq, is split after a merging with another bfq_queue, BFQ checks whether it has to restore for bfqq the weight-raising state that bfqq had before being merged. In particular, the weight-raising is restored only if, according to the weight-raising duration decided for bfqq when it started to be weight-raised (before being merged), bfqq would not have already finished its weight-raising period. Yet, by mistake, such a duration was not saved when bfqq is merged. So, if bfqq was freed and reallocated when it was split, then this duration was wrongly set to zero on the split. As a consequence, the weight-raising state of bfqq was wrongly not restored, which caused BFQ to fail in guaranteeing a low latency to bfqq. This commit fixes this bug by saving weight-raising duration when bfqq is merged, and correctly restoring it when bfqq is split. BUGFIX Fix wrong reset of in-service entities In-service entities were reset with an indirect logic, which happened to be even buggy for some cases. This commit fixes this bug in two important steps. First, by replacing this indirect logic with a direct logic, in which all involved entities are immediately reset, with a bubble-up loop, when the in-service queue is reset. Second, by restructuring the code related to this change, so as to become not only correct with respect to this change, but also cleaner and hopefully clearer. CODE IMPROVEMENT Add code to be able to redirect trace log to console. BUGFIX Fixed bug in optimized update of next_in_service entity. There was a case where bfq_update_next_in_service did not update next_in_service, even if it might need to be changed: in case of requeueing or repositioning of the entity that happened to be pointed exactly by next_in_service. This could result in violation of service guarantees, because, after a change of timestamps for such an entity, it might be the case that next_in_service had to point to a different entity. This commit fixes this bug. OPTIMIZATION Stop bubble-up of next_in_service update if possible. BUGFIX Fixed a false-positive warning for uninitialized var BFQ-v8r5 DOCUMENTATION IMPROVEMENT Added documentation of BFQ benefits, inner workings, interface and tunables. BUGFIX: Replaced max wrongly used for modulo numbers. DOCUMENTATION IMPROVEMENT Improved help message in Kconfig.iosched. BUGFIX: Removed wrong conversion in use of bfq_fifo_expire. CODE IMPROVEMENT Added parentheses to complex macros. Signed-off-by: Paolo Valente <paolo.valente@linaro.org>	2023-05-16 08:39:14 +09:00
Yuchung Cheng	5271b31147	tcp: allow at most one TLP probe per flight [ Upstream commit `76be93fc07` ] Previously TLP may send multiple probes of new data in one flight. This happens when the sender is cwnd limited. After the initial TLP containing new data is sent, the sender receives another ACK that acks partial inflight. It may re-arm another TLP timer to send more, if no further ACK returns before the next TLP timeout (PTO) expires. The sender may send in theory a large amount of TLP until send queue is depleted. This only happens if the sender sees such irregular uncommon ACK pattern. But it is generally undesirable behavior during congestion especially. The original TLP design restrict only one TLP probe per inflight as published in "Reducing Web Latency: the Virtue of Gentle Aggression", SIGCOMM 2013. This patch changes TLP to send at most one probe per inflight. Note that if the sender is app-limited, TLP retransmits old data and did not have this issue. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-16 08:39:02 +09:00
Michael J. Ruhl	d4ca9b4931	io-mapping: indicate mapping failure commit `e0b3e0b1a0` upstream. The !ATOMIC_IOMAP version of io_maping_init_wc will always return success, even when the ioremap fails. Since the ATOMIC_IOMAP version returns NULL when the init fails, and callers check for a NULL return on error this is unexpected. During a device probe, where the ioremap failed, a crash can look like this: BUG: unable to handle page fault for address: 0000000000210000 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page Oops: 0002 [#1] PREEMPT SMP CPU: 0 PID: 177 Comm: RIP: 0010:fill_page_dma [i915] gen8_ppgtt_create [i915] i915_ppgtt_create [i915] intel_gt_init [i915] i915_gem_init [i915] i915_driver_probe [i915] pci_device_probe really_probe driver_probe_device The remap failure occurred much earlier in the probe. If it had been propagated, the driver would have exited with an error. Return NULL on ioremap failure. [akpm@linux-foundation.org: detect ioremap_wc() errors earlier] Fixes: `cafaf14a5d` ("io-mapping: Always create a struct to hold metadata about the io-mapping") Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200721171936.81563-1-michael.j.ruhl@intel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-16 08:38:47 +09:00
Takashi Iwai	801c317e20	usb: core: Add a helper function to check the validity of EP type in URB commit `e901b98738` upstream. This patch adds a new helper function to perform a sanity check of the given URB to see whether it contains a valid endpoint. It's a light- weight version of what usb_submit_urb() does, but without the kernel warning followed by the stack trace, just returns an error code. Especially for a driver that doesn't parse the descriptor but fills the URB with the fixed endpoint (e.g. some quirks for non-compliant devices), this kind of check is preferable at the probe phase before actually submitting the urb. Tested-by: Andrey Konovalov <andreyknvl@google.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-16 08:30:42 +09:00
Cong Wang	89a4f176a2	cgroup: Fix sock_cgroup_data on big-endian. [ Upstream commit `14b032b8f8` ] In order for no_refcnt and is_data to be the lowest order two bits in the 'val' we have to pad out the bitfield of the u8. Fixes: `ad0f75e5f5` ("cgroup: fix cgroup_sk_alloc() for sk_clone_lock()") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-16 08:30:16 +09:00
Cong Wang	fca002646a	cgroup: fix cgroup_sk_alloc() for sk_clone_lock() [ Upstream commit `ad0f75e5f5` ] When we clone a socket in sk_clone_lock(), its sk_cgrp_data is copied, so the cgroup refcnt must be taken too. And, unlike the sk_alloc() path, sock_update_netprioidx() is not called here. Therefore, it is safe and necessary to grab the cgroup refcnt even when cgroup_sk_alloc is disabled. sk_clone_lock() is in BH context anyway, the in_interrupt() would terminate this function if called there. And for sk_alloc() skcd->val is always zero. So it's safe to factor out the code to make it more readable. The global variable 'cgroup_sk_alloc_disabled' is used to determine whether to take these reference counts. It is impossible to make the reference counting correct unless we save this bit of information in skcd->val. So, add a new bit there to record whether the socket has already taken the reference counts. This obviously relies on kmalloc() to align cgroup pointers to at least 4 bytes, ARCH_KMALLOC_MINALIGN is certainly larger than that. This bug seems to be introduced since the beginning, commit `d979a39d72` ("cgroup: duplicate cgroup reference when cloning sockets") tried to fix it but not compeletely. It seems not easy to trigger until the recent commit `090e28b229` ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups") was merged. Fixes: `bd1060a1d6` ("sock, cgroup: add sock->sk_cgroup") Reported-by: Cameron Berkenpas <cam@neo-zeon.de> Reported-by: Peter Geis <pgwipeout@gmail.com> Reported-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reported-by: Daniël Sonck <dsonck92@gmail.com> Reported-by: Zhang Qiang <qiang.zhang@windriver.com> Tested-by: Cameron Berkenpas <cam@neo-zeon.de> Tested-by: Peter Geis <pgwipeout@gmail.com> Tested-by: Thomas Lamprecht <t.lamprecht@proxmox.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Zefan Li <lizefan@huawei.com> Cc: Tejun Heo <tj@kernel.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-16 08:30:14 +09:00
Shile Zhang	6cd73be5b6	sched/rt: Show the 'sched_rr_timeslice' SCHED_RR timeslice tuning knob in milliseconds [ Upstream commit `975e155ed8` ] We added the 'sched_rr_timeslice_ms' SCHED_RR tuning knob in this commit: `ce0dbbbb30` ("sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice") ... which name suggests to users that it's in milliseconds, while in reality it's being set in milliseconds but the result is shown in jiffies. This is obviously confusing when HZ is not 1000, it makes it appear like the value set failed, such as HZ=100: root# echo 100 > /proc/sys/kernel/sched_rr_timeslice_ms root# cat /proc/sys/kernel/sched_rr_timeslice_ms 10 Fix this to be milliseconds all around. Signed-off-by: Shile Zhang <shile.zhang@nokia.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1485612049-20923-1-git-send-email-shile.zhang@nokia.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-16 08:29:07 +09:00
Alexander Lobakin	602eea333d	net: qed: fix left elements count calculation [ Upstream commit `97dd1abd02` ] qed_chain_get_element_left{,_u32} returned 0 when the difference between producer and consumer page count was equal to the total page count. Fix this by conditional expanding of producer value (vs unconditional). This allowed to eliminate normalizaton against total page count, which was the cause of this bug. Misc: replace open-coded constants with common defines. Fixes: `a91eb52abb` ("qed: Revisit chain implementation") Signed-off-by: Alexander Lobakin <alobakin@marvell.com> Signed-off-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-16 08:21:35 +09:00
Taehee Yoo	c2e7162a2f	net: core: reduce recursion limit value [ Upstream commit `fb7861d14c` ] In the current code, ->ndo_start_xmit() can be executed recursively only 10 times because of stack memory. But, in the case of the vxlan, 10 recursion limit value results in a stack overflow. In the current code, the nested interface is limited by 8 depth. There is no critical reason that the recursion limitation value should be 10. So, it would be good to be the same value with the limitation value of nesting interface depth. Test commands: ip link add vxlan10 type vxlan vni 10 dstport 4789 srcport 4789 4789 ip link set vxlan10 up ip a a 192.168.10.1/24 dev vxlan10 ip n a 192.168.10.2 dev vxlan10 lladdr fc:22:33:44:55:66 nud permanent for i in {9..0} do let A=$i+1 ip link add vxlan$i type vxlan vni $i dstport 4789 srcport 4789 4789 ip link set vxlan$i up ip a a 192.168.$i.1/24 dev vxlan$i ip n a 192.168.$i.2 dev vxlan$i lladdr fc:22:33:44:55:66 nud permanent bridge fdb add fc:22:33:44:55:66 dev vxlan$A dst 192.168.$i.2 self done hping3 192.168.10.2 -2 -d 60000 Splat looks like: [ 103.814237][ T1127] ============================================================================= [ 103.871955][ T1127] BUG kmalloc-2k (Tainted: G B ): Padding overwritten. 0x00000000897a2e4f-0x000 [ 103.873187][ T1127] ----------------------------------------------------------------------------- [ 103.873187][ T1127] [ 103.874252][ T1127] INFO: Slab 0x000000005cccc724 objects=5 used=5 fp=0x0000000000000000 flags=0x10000000001020 [ 103.881323][ T1127] CPU: 3 PID: 1127 Comm: hping3 Tainted: G B 5.7.0+ #575 [ 103.882131][ T1127] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 103.883006][ T1127] Call Trace: [ 103.883324][ T1127] dump_stack+0x96/0xdb [ 103.883716][ T1127] slab_err+0xad/0xd0 [ 103.884106][ T1127] ? _raw_spin_unlock+0x1f/0x30 [ 103.884620][ T1127] ? get_partial_node.isra.78+0x140/0x360 [ 103.885214][ T1127] slab_pad_check.part.53+0xf7/0x160 [ 103.885769][ T1127] ? pskb_expand_head+0x110/0xe10 [ 103.886316][ T1127] check_slab+0x97/0xb0 [ 103.886763][ T1127] alloc_debug_processing+0x84/0x1a0 [ 103.887308][ T1127] ___slab_alloc+0x5a5/0x630 [ 103.887765][ T1127] ? pskb_expand_head+0x110/0xe10 [ 103.888265][ T1127] ? lock_downgrade+0x730/0x730 [ 103.888762][ T1127] ? pskb_expand_head+0x110/0xe10 [ 103.889244][ T1127] ? __slab_alloc+0x3e/0x80 [ 103.889675][ T1127] __slab_alloc+0x3e/0x80 [ 103.890108][ T1127] __kmalloc_node_track_caller+0xc7/0x420 [ ... ] Fixes: `11a766ce91` ("net: Increase xmit RECURSION_LIMIT to 10.") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-16 08:21:04 +09:00
Jiri Olsa	020742f920	kretprobe: Prevent triggering kretprobe from within kprobe_flush_task [ Upstream commit `9b38cc704e` ] Ziqian reported lockup when adding retprobe on _raw_spin_lock_irqsave. My test was also able to trigger lockdep output: ============================================ WARNING: possible recursive locking detected 5.6.0-rc6+ #6 Not tainted -------------------------------------------- sched-messaging/2767 is trying to acquire lock: ffffffff9a492798 (&(kretprobe_table_locks[i].lock)){-.-.}, at: kretprobe_hash_lock+0x52/0xa0 but task is already holding lock: ffffffff9a491a18 (&(kretprobe_table_locks[i].lock)){-.-.}, at: kretprobe_trampoline+0x0/0x50 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&(kretprobe_table_locks[i].lock)); lock(&(kretprobe_table_locks[i].lock)); * DEADLOCK * May be due to missing lock nesting notation 1 lock held by sched-messaging/2767: #0: ffffffff9a491a18 (&(kretprobe_table_locks[i].lock)){-.-.}, at: kretprobe_trampoline+0x0/0x50 stack backtrace: CPU: 3 PID: 2767 Comm: sched-messaging Not tainted 5.6.0-rc6+ #6 Call Trace: dump_stack+0x96/0xe0 __lock_acquire.cold.57+0x173/0x2b7 ? native_queued_spin_lock_slowpath+0x42b/0x9e0 ? lockdep_hardirqs_on+0x590/0x590 ? __lock_acquire+0xf63/0x4030 lock_acquire+0x15a/0x3d0 ? kretprobe_hash_lock+0x52/0xa0 _raw_spin_lock_irqsave+0x36/0x70 ? kretprobe_hash_lock+0x52/0xa0 kretprobe_hash_lock+0x52/0xa0 trampoline_handler+0xf8/0x940 ? kprobe_fault_handler+0x380/0x380 ? find_held_lock+0x3a/0x1c0 kretprobe_trampoline+0x25/0x50 ? lock_acquired+0x392/0xbc0 ? _raw_spin_lock_irqsave+0x50/0x70 ? __get_valid_kprobe+0x1f0/0x1f0 ? _raw_spin_unlock_irqrestore+0x3b/0x40 ? finish_task_switch+0x4b9/0x6d0 ? __switch_to_asm+0x34/0x70 ? __switch_to_asm+0x40/0x70 The code within the kretprobe handler checks for probe reentrancy, so we won't trigger any _raw_spin_lock_irqsave probe in there. The problem is in outside kprobe_flush_task, where we call: kprobe_flush_task kretprobe_table_lock raw_spin_lock_irqsave _raw_spin_lock_irqsave where _raw_spin_lock_irqsave triggers the kretprobe and installs kretprobe_trampoline handler on _raw_spin_lock_irqsave return. The kretprobe_trampoline handler is then executed with already locked kretprobe_table_locks, and first thing it does is to lock kretprobe_table_locks ;-) the whole lockup path like: kprobe_flush_task kretprobe_table_lock raw_spin_lock_irqsave _raw_spin_lock_irqsave ---> probe triggered, kretprobe_trampoline installed ---> kretprobe_table_locks locked kretprobe_trampoline trampoline_handler kretprobe_hash_lock(current, &head, &flags); <--- deadlock Adding kprobe_busy_begin/end helpers that mark code with fake probe installed to prevent triggering of another kprobe within this code. Using these helpers in kprobe_flush_task, so the probe recursion protection check is hit and the probe is never set to prevent above lockup. Link: http://lkml.kernel.org/r/158927059835.27680.7011202830041561604.stgit@devnote2 Fixes: `ef53d9c5e4` ("kprobes: improve kretprobe scalability with hashed locking") Cc: Ingo Molnar <mingo@kernel.org> Cc: "Gustavo A . R . Silva" <gustavoars@kernel.org> Cc: Anders Roxell <anders.roxell@linaro.org> Cc: "Naveen N . Rao" <naveen.n.rao@linux.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David Miller <davem@davemloft.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: stable@vger.kernel.org Reported-by: "Ziqian SUN (Zamir)" <zsun@redhat.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-16 08:20:37 +09:00
Ahmed S. Darwish	858e153493	block: nr_sects_write(): Disable preemption on seqcount write [ Upstream commit `15b81ce5ab` ] For optimized block readers not holding a mutex, the "number of sectors" 64-bit value is protected from tearing on 32-bit architectures by a sequence counter. Disable preemption before entering that sequence counter's write side critical section. Otherwise, the read side can preempt the write side section and spin for the entire scheduler tick. If the reader belongs to a real-time scheduling class, it can spin forever and the kernel will livelock. Fixes: `c83f6bf98d` ("block: add partition resize function to blkpg ioctl") Cc: <stable@vger.kernel.org> Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-16 08:20:27 +09:00
Kai-Heng Feng	7a6de8a9dc	libata: Use per port sync for detach [ Upstream commit `b5292111de` ] Commit `130f4caf14` ("libata: Ensure ata_port probe has completed before detach") may cause system freeze during suspend. Using async_synchronize_full() in PM callbacks is wrong, since async callbacks that are already scheduled may wait for not-yet-scheduled callbacks, causes a circular dependency. Instead of using big hammer like async_synchronize_full(), use async cookie to make sure port probe are synced, without affecting other scheduled PM callbacks. Fixes: `130f4caf14` ("libata: Ensure ata_port probe has completed before detach") Suggested-by: John Garry <john.garry@huawei.com> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Tested-by: John Garry <john.garry@huawei.com> BugLink: https://bugs.launchpad.net/bugs/1867983 Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-16 08:20:16 +09:00
Nick Desaulniers	e31535e9ca	elfnote: mark all .note sections SHF_ALLOC [ Upstream commit `51da9dfb7f` ] ELFNOTE_START allows callers to specify flags for .pushsection assembler directives. All callsites but ELF_NOTE use "a" for SHF_ALLOC. For vdso's that explicitly use ELF_NOTE_START and BUILD_SALT, the same section is specified twice after preprocessing, once with "a" flag, once without. Example: .pushsection .note.Linux, "a", @note ; .pushsection .note.Linux, "", @note ; While GNU as allows this ordering, it warns for the opposite ordering, making these directives position dependent. We'd prefer not to precisely match this behavior in Clang's integrated assembler. Instead, the non __ASSEMBLY__ definition of ELF_NOTE uses __attribute__((section(".note.Linux"))) which is created with SHF_ALLOC, so let's make the __ASSEMBLY__ definition of ELF_NOTE consistent with C and just always use "a" flag. This allows Clang to assemble a working mainline (5.6) kernel via: $ make CC=clang AS=clang Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Nathan Chancellor <natechancellor@gmail.com> Reviewed-by: Fangrui Song <maskray@google.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Link: https://github.com/ClangBuiltLinux/linux/issues/913 Link: http://lkml.kernel.org/r/20200325231250.99205-1-ndesaulniers@google.com Debugged-by: Ilie Halip <ilie.halip@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-16 08:20:01 +09:00
Arnd Bergmann	fa53e3a7b9	include/linux/bitops.h: avoid clang shift-count-overflow warnings [ Upstream commit `bd93f003b7` ] Clang normally does not warn about certain issues in inline functions when it only happens in an eliminated code path. However if something else goes wrong, it does tend to complain about the definition of hweight_long() on 32-bit targets: include/linux/bitops.h:75:41: error: shift count >= width of type [-Werror,-Wshift-count-overflow] return sizeof(w) == 4 ? hweight32(w) : hweight64(w); ^~~~~~~~~~~~ include/asm-generic/bitops/const_hweight.h:29:49: note: expanded from macro 'hweight64' define hweight64(w) (__builtin_constant_p(w) ? __const_hweight64(w) : __arch_hweight64(w)) ^~~~~~~~~~~~~~~~~~~~ include/asm-generic/bitops/const_hweight.h:21:76: note: expanded from macro '__const_hweight64' define __const_hweight64(w) (__const_hweight32(w) + __const_hweight32((w) >> 32)) ^ ~~ include/asm-generic/bitops/const_hweight.h:20:49: note: expanded from macro '__const_hweight32' define __const_hweight32(w) (__const_hweight16(w) + __const_hweight16((w) >> 16)) ^ include/asm-generic/bitops/const_hweight.h:19:72: note: expanded from macro '__const_hweight16' define __const_hweight16(w) (__const_hweight8(w) + __const_hweight8((w) >> 8 )) ^ include/asm-generic/bitops/const_hweight.h:12:9: note: expanded from macro '__const_hweight8' (!!((w) & (1ULL << 2))) + \ Adding an explicit cast to __u64 avoids that warning and makes it easier to read other output. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Link: http://lkml.kernel.org/r/20200505135513.65265-1-arnd@arndb.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-16 08:20:00 +09:00
Pawel Laszczak	f1299d047d	usb: gadget: Fix issue with config_ep_by_speed function [ Upstream commit `5d363120aa` ] This patch adds new config_ep_by_speed_and_alt function which extends the config_ep_by_speed about alt parameter. This additional parameter allows to find proper usb_ss_ep_comp_descriptor. Problem has appeared during testing f_tcm (BOT/UAS) driver function. f_tcm function for SS use array of headers for both BOT/UAS alternate setting: static struct usb_descriptor_header uasp_ss_function_desc[] = { (struct usb_descriptor_header ) &bot_intf_desc, (struct usb_descriptor_header ) &uasp_ss_bi_desc, (struct usb_descriptor_header ) &bot_bi_ep_comp_desc, (struct usb_descriptor_header ) &uasp_ss_bo_desc, (struct usb_descriptor_header ) &bot_bo_ep_comp_desc, (struct usb_descriptor_header ) &uasp_intf_desc, (struct usb_descriptor_header ) &uasp_ss_bi_desc, (struct usb_descriptor_header ) &uasp_bi_ep_comp_desc, (struct usb_descriptor_header ) &uasp_bi_pipe_desc, (struct usb_descriptor_header ) &uasp_ss_bo_desc, (struct usb_descriptor_header ) &uasp_bo_ep_comp_desc, (struct usb_descriptor_header ) &uasp_bo_pipe_desc, (struct usb_descriptor_header ) &uasp_ss_status_desc, (struct usb_descriptor_header ) &uasp_status_in_ep_comp_desc, (struct usb_descriptor_header ) &uasp_status_pipe_desc, (struct usb_descriptor_header ) &uasp_ss_cmd_desc, (struct usb_descriptor_header ) &uasp_cmd_comp_desc, (struct usb_descriptor_header ) &uasp_cmd_pipe_desc, NULL, }; The first 5 descriptors are associated with BOT alternate setting, and others are associated with UAS. During handling UAS alternate setting f_tcm driver invokes config_ep_by_speed and this function sets incorrect companion endpoint descriptor in usb_ep object. Instead setting ep->comp_desc to uasp_bi_ep_comp_desc function in this case set ep->comp_desc to uasp_ss_bi_desc. This is due to the fact that it searches endpoint based on endpoint address: for_each_ep_desc(speed_desc, d_spd) { chosen_desc = (struct usb_endpoint_descriptor )*d_spd; if (chosen_desc->bEndpoitAddress == _ep->address) goto ep_found; } And in result it uses the descriptor from BOT alternate setting instead UAS. Finally, it causes that controller driver during enabling endpoints detect that just enabled endpoint for bot. Signed-off-by: Jayshri Pawar <jpawar@cadence.com> Signed-off-by: Pawel Laszczak <pawell@cadence.com> Signed-off-by: Felipe Balbi <balbi@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-16 08:19:44 +09:00
NeilBrown	f0016cc6da	sunrpc: clean up properly in gss_mech_unregister() commit `24c5efe41c` upstream. gss_mech_register() calls svcauth_gss_register_pseudoflavor() for each flavour, but gss_mech_unregister() does not call auth_domain_put(). This is unbalanced and makes it impossible to reload the module. Change svcauth_gss_register_pseudoflavor() to return the registered auth_domain, and save it for later release. Cc: stable@vger.kernel.org (v2.6.12+) Link: https://bugzilla.kernel.org/show_bug.cgi?id=206651 Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 17:34:44 +09:00
Daniel Thompson	0a739b71ae	kgdb: Fix spurious true from in_dbg_master() [ Upstream commit `3fec4aecb3` ] Currently there is a small window where a badly timed migration could cause in_dbg_master() to spuriously return true. Specifically if we migrate to a new core after reading the processor id and the previous core takes a breakpoint then we will evaluate true if we read kgdb_active before we get the IPI to bring us to halt. Fix this by checking irqs_disabled() first. Interrupts are always disabled when we are executing the kgdb trap so this is an acceptable prerequisite. This also allows us to replace raw_smp_processor_id() with smp_processor_id() since the short circuit logic will prevent warnings from PREEMPT_DEBUG. Fixes: `dcc7871128` ("kgdb: core changes to support kdb") Suggested-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20200506164223.2875760-1-daniel.thompson@linaro.org Reviewed-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-15 17:33:45 +09:00
Mark Gross	cd078ae2cd	x86/cpu: Add a steppings field to struct x86_cpu_id commit `e9d7144597` upstream Intel uses the same family/model for several CPUs. Sometimes the stepping must be checked to tell them apart. On x86 there can be at most 16 steppings. Add a steppings bitmask to x86_cpu_id and a X86_MATCH_VENDOR_FAMILY_MODEL_STEPPING_FEATURE macro and support for matching against family/model/stepping. [ bp: Massage. tglx: Lightweight variant for backporting ] Signed-off-by: Mark Gross <mgross@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 17:31:35 +09:00
Pablo Neira Ayuso	2eb5f02ba5	netfilter: nf_conntrack_pptp: fix compilation warning with W=1 build commit `4946ea5c12` upstream. >> include/linux/netfilter/nf_conntrack_pptp.h:13:20: warning: 'const' type qualifier on return type has no effect [-Wignored-qualifiers] extern const char *const pptp_msg_name(u_int16_t msg); ^~~~~~ Reported-by: kbuild test robot <lkp@intel.com> Fixes: `4c559f15ef` ("netfilter: nf_conntrack_pptp: prevent buffer overflows in debug code") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 17:30:41 +09:00
Pablo Neira Ayuso	79ce1a0f3c	netfilter: nf_conntrack_pptp: prevent buffer overflows in debug code commit `4c559f15ef` upstream. Dan Carpenter says: "Smatch complains that the value for "cmd" comes from the network and can't be trusted." Add pptp_msg_name() helper function that checks for the array boundary. Fixes: `f09943fefe` ("[NETFILTER]: nf_conntrack/nf_nat: add PPTP helper port") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 17:30:37 +09:00
Konstantin Khlebnikov	c24ad38aaa	mm: remove VM_BUG_ON(PageSlab()) from page_mapcount() [ Upstream commit `6988f31d55` ] Replace superfluous VM_BUG_ON() with comment about correct usage. Technically reverts commit `1d148e218a` ("mm: add VM_BUG_ON_PAGE() to page_mapcount()"), but context lines have changed. Function isolate_migratepages_block() runs some checks out of lru_lock when choose pages for migration. After checking PageLRU() it checks extra page references by comparing page_count() and page_mapcount(). Between these two checks page could be removed from lru, freed and taken by slab. As a result this race triggers VM_BUG_ON(PageSlab()) in page_mapcount(). Race window is tiny. For certain workload this happens around once a year. page:ffffea0105ca9380 count:1 mapcount:0 mapping:ffff88ff7712c180 index:0x0 compound_mapcount: 0 flags: 0x500000000008100(slab\|head) raw: 0500000000008100 dead000000000100 dead000000000200 ffff88ff7712c180 raw: 0000000000000000 0000000080200020 00000001ffffffff 0000000000000000 page dumped because: VM_BUG_ON_PAGE(PageSlab(page)) ------------[ cut here ]------------ kernel BUG at ./include/linux/mm.h:628! invalid opcode: 0000 [#1] SMP NOPTI CPU: 77 PID: 504 Comm: kcompactd1 Tainted: G W 4.19.109-27 #1 Hardware name: Yandex T175-N41-Y3N/MY81-EX0-Y3N, BIOS R05 06/20/2019 RIP: 0010:isolate_migratepages_block+0x986/0x9b0 The code in isolate_migratepages_block() was added in commit `119d6d59dc` ("mm, compaction: avoid isolating pinned pages") before adding VM_BUG_ON into page_mapcount(). This race has been predicted in 2015 by Vlastimil Babka (see link below). [akpm@linux-foundation.org: comment tweaks, per Hugh] Fixes: `1d148e218a` ("mm: add VM_BUG_ON_PAGE() to page_mapcount()") Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/159032779896.957378.7852761411265662220.stgit@buzz Link: https://lore.kernel.org/lkml/557710E1.6060103@suse.cz/ Link: https://lore.kernel.org/linux-mm/158937872515.474360.5066096871639561424.stgit@buzz/T/ (v1) Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-15 17:30:18 +09:00
Moshe Shemesh	e6d640c2c5	net/mlx5: Add command entry handling completion [ Upstream commit `17d00e839d` ] When FW response to commands is very slow and all command entries in use are waiting for completion we can have a race where commands can get timeout before they get out of the queue and handled. Timeout completion on uninitialized command will cause releasing command's buffers before accessing it for initialization and then we will get NULL pointer exception while trying access it. It may also cause releasing buffers of another command since we may have timeout completion before even allocating entry index for this command. Add entry handling completion to avoid this race. Fixes: `e126ba97db` ("mlx5: Add driver for Mellanox Connect-IB adapters") Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 17:29:47 +09:00
R. Parameswaran	8b44f64d82	l2tp: device MTU setup, tunnel socket needs a lock commit `57240d0078` upstream. The MTU overhead calculation in L2TP device set-up merged via commit `b784e7ebfc` needs to be adjusted to lock the tunnel socket while referencing the sub-data structures to derive the socket's IP overhead. Reported-by: Guillaume Nault <g.nault@alphalink.fr> Tested-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: R. Parameswaran <rparames@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Giuliano Procida <gprocida@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 17:29:18 +09:00
R. Parameswaran	d7f250f770	New kernel function to get IP overhead on a socket. commit `113c307593` upstream. A new function, kernel_sock_ip_overhead(), is provided to calculate the cumulative overhead imposed by the IP Header and IP options, if any, on a socket's payload. The new function returns an overhead of zero for sockets that do not belong to the IPv4 or IPv6 address families. This is used in the L2TP code path to compute the total outer IP overhead on the L2TP tunnel socket when calculating the default MTU for Ethernet pseudowires. Signed-off-by: R. Parameswaran <rparames@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Giuliano Procida <gprocida@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 17:28:56 +09:00
Herbert Xu	a096ae17d6	padata: Replace delayed timer with immediate workqueue in padata_reorder [ Upstream commit `6fc4dbcf02` ] The function padata_reorder will use a timer when it cannot progress while completed jobs are outstanding (pd->reorder_objects > 0). This is suboptimal as if we do end up using the timer then it would have introduced a gratuitous delay of one second. In fact we can easily distinguish between whether completed jobs are outstanding and whether we can make progress. All we have to do is look at the next pqueue list. This patch does that by replacing pd->processed with pd->cpu so that the next pqueue is more accessible. A work queue is used instead of the original try_again to avoid hogging the CPU. Note that we don't bother removing the work queue in padata_flush_queues because the whole premise is broken. You cannot flush async crypto requests so it makes no sense to even try. A subsequent patch will fix it by replacing it with a ref counting scheme. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> [dj: - adjust context - corrected setup_timer -> timer_setup to delete hunk - skip padata_flush_queues() hunk, function already removed in 4.9] Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-15 17:28:48 +09:00
Mathias Krause	e794421930	padata: ensure padata_do_serial() runs on the correct CPU commit `350ef88e7e` upstream. If the algorithm we're parallelizing is asynchronous we might change CPUs between padata_do_parallel() and padata_do_serial(). However, we don't expect this to happen as we need to enqueue the padata object into the per-cpu reorder queue we took it from, i.e. the same-cpu's parallel queue. Ensure we're not switching CPUs for a given padata object by tracking the CPU within the padata object. If the serial callback gets called on the wrong CPU, defer invoking padata_reorder() via a kernel worker on the CPU we're expected to run on. Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 17:28:29 +09:00
Mathias Krause	85a4a07617	padata: ensure the reorder timer callback runs on the correct CPU commit `cf5868c8a2` upstream. The reorder timer function runs on the CPU where the timer interrupt was handled which is not necessarily one of the CPUs of the 'pcpu' CPU mask set. Ensure the padata_reorder() callback runs on the correct CPU, which is one in the 'pcpu' CPU mask set and, preferrably, the next expected one. Do so by comparing the current CPU with the expected target CPU. If they match, call padata_reorder() right away. If they differ, schedule a work item on the target CPU that does the padata_reorder() call for us. Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 17:28:28 +09:00
secuflag	a1c0c331bb	arm64: Fix build with clang arch/arm64/kernel/cpu_errata.c:274:3: error: unknown register name 'r0' in asm arm_smccc_1_1_hvc(ARM_SMCCC_ARCH_WORKAROUND_2, state, NULL); ^ ./include/linux/arm-smccc.h:309:32: note: expanded from macro 'arm_smccc_1_1_hvc' ^ ./include/linux/arm-smccc.h:272:3: note: expanded from macro '__arm_smccc_1_1' __declare_args(__count_args(__VA_ARGS__), __VA_ARGS__); \ ^ ./include/linux/arm-smccc.h:257:37: note: expanded from macro '__declare_args' ^ ./include/linux/arm-smccc.h:256:37: note: expanded from macro '___declare_args' ^ <scratch space>:8:1: note: expanded from here __declare_arg_1 ^ ./include/linux/arm-smccc.h:212:32: note: expanded from macro '__declare_arg_1' register unsigned long r0 asm("r0") = (u32)a0; \ ^ arch/arm64/kernel/cpu_errata.c:274:3: error: unknown register name 'r1' in asm ./include/linux/arm-smccc.h:309:32: note: expanded from macro 'arm_smccc_1_1_hvc' ^ ./include/linux/arm-smccc.h:272:3: note: expanded from macro '__arm_smccc_1_1' __declare_args(__count_args(__VA_ARGS__), __VA_ARGS__); \ ^ ./include/linux/arm-smccc.h:257:37: note: expanded from macro '__declare_args' ^ ./include/linux/arm-smccc.h:256:37: note: expanded from macro '___declare_args' ^ <scratch space>:8:1: note: expanded from here __declare_arg_1 ^ ./include/linux/arm-smccc.h:213:32: note: expanded from macro '__declare_arg_1' register unsigned long r1 asm("r1") = __a1; \ ^ arch/arm64/kernel/cpu_errata.c:278:3: error: unknown register name 'r0' in asm arm_smccc_1_1_smc(ARM_SMCCC_ARCH_WORKAROUND_2, state, NULL); ^ ./include/linux/arm-smccc.h:293:32: note: expanded from macro 'arm_smccc_1_1_smc' ^ ./include/linux/arm-smccc.h:272:3: note: expanded from macro '__arm_smccc_1_1' __declare_args(__count_args(__VA_ARGS__), __VA_ARGS__); \ ^ ./include/linux/arm-smccc.h:257:37: note: expanded from macro '__declare_args' ^ ./include/linux/arm-smccc.h:256:37: note: expanded from macro '___declare_args' ^ <scratch space>:14:1: note: expanded from here __declare_arg_1 ^ ./include/linux/arm-smccc.h:212:32: note: expanded from macro '__declare_arg_1' register unsigned long r0 asm("r0") = (u32)a0; \ ^ arch/arm64/kernel/cpu_errata.c:278:3: error: unknown register name 'r1' in asm ./include/linux/arm-smccc.h:293:32: note: expanded from macro 'arm_smccc_1_1_smc' ^ ./include/linux/arm-smccc.h:272:3: note: expanded from macro '__arm_smccc_1_1' __declare_args(__count_args(__VA_ARGS__), __VA_ARGS__); \ ^ ./include/linux/arm-smccc.h:257:37: note: expanded from macro '__declare_args' ^ ./include/linux/arm-smccc.h:256:37: note: expanded from macro '___declare_args' ^ <scratch space>:14:1: note: expanded from here __declare_arg_1 ^ ./include/linux/arm-smccc.h:213:32: note: expanded from macro '__declare_arg_1' register unsigned long r1 asm("r1") = __a1; \ ^ arch/arm64/kernel/cpu_errata.c:303:3: error: unknown register name 'r0' in asm arm_smccc_1_1_hvc(ARM_SMCCC_ARCH_FEATURES_FUNC_ID, ^ ./include/linux/arm-smccc.h:309:32: note: expanded from macro 'arm_smccc_1_1_hvc' ^ ./include/linux/arm-smccc.h:272:3: note: expanded from macro '__arm_smccc_1_1' __declare_args(__count_args(__VA_ARGS__), __VA_ARGS__); \ ^ ./include/linux/arm-smccc.h:257:37: note: expanded from macro '__declare_args' ^ ./include/linux/arm-smccc.h:256:37: note: expanded from macro '___declare_args' ^ <scratch space>:30:1: note: expanded from here __declare_arg_1 ^ ./include/linux/arm-smccc.h:212:32: note: expanded from macro '__declare_arg_1' register unsigned long r0 asm("r0") = (u32)a0; \ ^ arch/arm64/kernel/cpu_errata.c:303:3: error: unknown register name 'r1' in asm ./include/linux/arm-smccc.h:309:32: note: expanded from macro 'arm_smccc_1_1_hvc' ^ ./include/linux/arm-smccc.h:272:3: note: expanded from macro '__arm_smccc_1_1' __declare_args(__count_args(__VA_ARGS__), __VA_ARGS__); \ ^ ./include/linux/arm-smccc.h:257:37: note: expanded from macro '__declare_args' ^ ./include/linux/arm-smccc.h:256:37: note: expanded from macro '___declare_args' ^ <scratch space>:30:1: note: expanded from here __declare_arg_1 ^ ./include/linux/arm-smccc.h:213:32: note: expanded from macro '__declare_arg_1' register unsigned long r1 asm("r1") = __a1; \ ^ arch/arm64/kernel/cpu_errata.c:308:3: error: unknown register name 'r0' in asm arm_smccc_1_1_smc(ARM_SMCCC_ARCH_FEATURES_FUNC_ID, ^ ./include/linux/arm-smccc.h:293:32: note: expanded from macro 'arm_smccc_1_1_smc' ^ ./include/linux/arm-smccc.h:272:3: note: expanded from macro '__arm_smccc_1_1' __declare_args(__count_args(__VA_ARGS__), __VA_ARGS__); \ ^ ./include/linux/arm-smccc.h:257:37: note: expanded from macro '__declare_args' ^ ./include/linux/arm-smccc.h:256:37: note: expanded from macro '___declare_args' ^ <scratch space>:37:1: note: expanded from here __declare_arg_1 ^ ./include/linux/arm-smccc.h:212:32: note: expanded from macro '__declare_arg_1' register unsigned long r0 asm("r0") = (u32)a0; \ ^ arch/arm64/kernel/cpu_errata.c:308:3: error: unknown register name 'r1' in asm ./include/linux/arm-smccc.h:293:32: note: expanded from macro 'arm_smccc_1_1_smc' ^ ./include/linux/arm-smccc.h:272:3: note: expanded from macro '__arm_smccc_1_1' __declare_args(__count_args(__VA_ARGS__), __VA_ARGS__); \ ^ ./include/linux/arm-smccc.h:257:37: note: expanded from macro '__declare_args' ^ ./include/linux/arm-smccc.h:256:37: note: expanded from macro '___declare_args' ^ <scratch space>:37:1: note: expanded from here __declare_arg_1 ^ ./include/linux/arm-smccc.h:213:32: note: expanded from macro '__declare_arg_1' register unsigned long r1 asm("r1") = __a1; \	2023-05-15 17:28:23 +09:00
Borislav Petkov	31cad0d2b7	x86: Fix early boot crash on gcc-10, third try commit `a9a3ed1eff` upstream. ... or the odyssey of trying to disable the stack protector for the function which generates the stack canary value. The whole story started with Sergei reporting a boot crash with a kernel built with gcc-10: Kernel panic — not syncing: stack-protector: Kernel stack is corrupted in: start_secondary CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-rc5—00235—gfffb08b37df9 #139 Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./H77M—D3H, BIOS F12 11/14/2013 Call Trace: dump_stack panic ? start_secondary __stack_chk_fail start_secondary secondary_startup_64 -—-[ end Kernel panic — not syncing: stack—protector: Kernel stack is corrupted in: start_secondary This happens because gcc-10 tail-call optimizes the last function call in start_secondary() - cpu_startup_entry() - and thus emits a stack canary check which fails because the canary value changes after the boot_init_stack_canary() call. To fix that, the initial attempt was to mark the one function which generates the stack canary with: __attribute__((optimize("-fno-stack-protector"))) ... start_secondary(void *unused) however, using the optimize attribute doesn't work cumulatively as the attribute does not add to but rather replaces previously supplied optimization options - roughly all -fxxx options. The key one among them being -fno-omit-frame-pointer and thus leading to not present frame pointer - frame pointer which the kernel needs. The next attempt to prevent compilers from tail-call optimizing the last function call cpu_startup_entry(), shy of carving out start_secondary() into a separate compilation unit and building it with -fno-stack-protector, was to add an empty asm(""). This current solution was short and sweet, and reportedly, is supported by both compilers but we didn't get very far this time: future (LTO?) optimization passes could potentially eliminate this, which leads us to the third attempt: having an actual memory barrier there which the compiler cannot ignore or move around etc. That should hold for a long time, but hey we said that about the other two solutions too so... Reported-by: Sergei Trofimovich <slyfox@gentoo.org> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Kalle Valo <kvalo@codeaurora.org> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200314164451.346497-1-slyfox@gentoo.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 17:28:09 +09:00
Linus Torvalds	97b25a15da	gcc-10 warnings: fix low-hanging fruit commit `9d82973e03` upstream. Due to a bug-report that was compiler-dependent, I updated one of my machines to gcc-10. That shows a lot of new warnings. Happily they seem to be mostly the valid kind, but it's going to cause a round of churn for getting rid of them.. This is the really low-hanging fruit of removing a couple of zero-sized arrays in some core code. We have had a round of these patches before, and we'll have many more coming, and there is nothing special about these except that they were particularly trivial, and triggered more warnings than most. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 17:27:48 +09:00
Jason Gunthorpe	e795998f7c	pnp: Use list_for_each_entry() instead of open coding commit `01b2bafe57` upstream. Aside from good practice, this avoids a warning from gcc 10: ./include/linux/kernel.h:997:3: warning: array subscript -31 is outside array bounds of ‘struct list_head[1]’ [-Warray-bounds] 997 \| ((type *)(__mptr - offsetof(type, member))); }) \| ~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ./include/linux/list.h:493:2: note: in expansion of macro ‘container_of’ 493 \| container_of(ptr, type, member) \| ^~~~~~~~~~~~ ./include/linux/pnp.h:275:30: note: in expansion of macro ‘list_entry’ 275 \| #define global_to_pnp_dev(n) list_entry(n, struct pnp_dev, global_list) \| ^~~~~~~~~~ ./include/linux/pnp.h:281:11: note: in expansion of macro ‘global_to_pnp_dev’ 281 \| (dev) != global_to_pnp_dev(&pnp_global); \ \| ^~~~~~~~~~~~~~~~~ arch/x86/kernel/rtc.c:189:2: note: in expansion of macro ‘pnp_for_each_dev’ 189 \| pnp_for_each_dev(dev) { Because the common code doesn't cast the starting list_head to the containing struct. Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> [ rjw: Whitespace adjustments ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 17:27:47 +09:00
Vladis Dronov	fd1fe48249	ptp: fix the race between the release of ptp_clock and cdev commit `a33121e548` upstream. In a case when a ptp chardev (like /dev/ptp0) is open but an underlying device is removed, closing this file leads to a race. This reproduces easily in a kvm virtual machine: ts# cat openptp0.c int main() { ... fp = fopen("/dev/ptp0", "r"); ... sleep(10); } ts# uname -r 5.5.0-rc3-46cf053e ts# cat /proc/cmdline ... slub_debug=FZP ts# modprobe ptp_kvm ts# ./openptp0 & [1] 670 opened /dev/ptp0, sleeping 10s... ts# rmmod ptp_kvm ts# ls /dev/ptp* ls: cannot access '/dev/ptp': No such file or directory ts# ...woken up [ 48.010809] general protection fault: 0000 [#1] SMP [ 48.012502] CPU: 6 PID: 658 Comm: openptp0 Not tainted 5.5.0-rc3-46cf053e #25 [ 48.014624] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ... [ 48.016270] RIP: 0010:module_put.part.0+0x7/0x80 [ 48.017939] RSP: 0018:ffffb3850073be00 EFLAGS: 00010202 [ 48.018339] RAX: 000000006b6b6b6b RBX: 6b6b6b6b6b6b6b6b RCX: ffff89a476c00ad0 [ 48.018936] RDX: fffff65a08d3ea08 RSI: 0000000000000247 RDI: 6b6b6b6b6b6b6b6b [ 48.019470] ... ^^^ a slub poison [ 48.023854] Call Trace: [ 48.024050] __fput+0x21f/0x240 [ 48.024288] task_work_run+0x79/0x90 [ 48.024555] do_exit+0x2af/0xab0 [ 48.024799] ? vfs_write+0x16a/0x190 [ 48.025082] do_group_exit+0x35/0x90 [ 48.025387] __x64_sys_exit_group+0xf/0x10 [ 48.025737] do_syscall_64+0x3d/0x130 [ 48.026056] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 48.026479] RIP: 0033:0x7f53b12082f6 [ 48.026792] ... [ 48.030945] Modules linked in: ptp i6300esb watchdog [last unloaded: ptp_kvm] [ 48.045001] Fixing recursive fault but reboot is needed! This happens in: static void __fput(struct file file) { ... if (file->f_op->release) file->f_op->release(inode, file); <<< cdev is kfree'd here if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL && !(mode & FMODE_PATH))) { cdev_put(inode->i_cdev); <<< cdev fields are accessed here Namely: __fput() posix_clock_release() kref_put(&clk->kref, delete_clock) <<< the last reference delete_clock() delete_ptp_clock() kfree(ptp) <<< cdev is embedded in ptp cdev_put module_put(p->owner) <<< *p is kfree'd, bang! Here cdev is embedded in posix_clock which is embedded in ptp_clock. The race happens because ptp_clock's lifetime is controlled by two refcounts: kref and cdev.kobj in posix_clock. This is wrong. Make ptp_clock's sysfs device a parent of cdev with cdev_device_add() created especially for such cases. This way the parent device with its ptp_clock is not released until all references to the cdev are released. This adds a requirement that an initialized but not exposed struct device should be provided to posix_clock_register() by a caller instead of a simple dev_t. This approach was adopted from the commit `72139dfa24` ("watchdog: Fix the race between the release of watchdog_core_data and cdev"). See details of the implementation in the commit `233ed09d7f` ("chardev: add helper function to register char devs with a struct device"). Link: https://lore.kernel.org/linux-fsdevel/20191125125342.6189-1-vdronov@redhat.com/T/#u Analyzed-by: Stephen Johnston <sjohnsto@redhat.com> Analyzed-by: Vern Lovejoy <vlovejoy@redhat.com> Signed-off-by: Vladis Dronov <vdronov@redhat.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-15 17:27:24 +09:00
Logan Gunthorpe	010ab6af87	chardev: add helper function to register char devs with a struct device commit `233ed09d7f` upstream. Credit for this patch goes is shared with Dan Williams [1]. I've taken things one step further to make the helper function more useful and clean up calling code. There's a common pattern in the kernel whereby a struct cdev is placed in a structure along side a struct device which manages the life-cycle of both. In the naive approach, the reference counting is broken and the struct device can free everything before the chardev code is entirely released. Many developers have solved this problem by linking the internal kobjs in this fashion: cdev.kobj.parent = &parent_dev.kobj; The cdev code explicitly gets and puts a reference to it's kobj parent. So this seems like it was intended to be used this way. Dmitrty Torokhov first put this in place in 2012 with this commit: `2f0157f` char_dev: pin parent kobject and the first instance of the fix was then done in the input subsystem in the following commit: `4a215aa` Input: fix use-after-free introduced with dynamic minor changes Subsequently over the years, however, this issue seems to have tripped up multiple developers independently. For example, see these commits: `0d5b7da` iio: Prevent race between IIO chardev opening and IIO device (by Lars-Peter Clausen in 2013) `ba0ef85` tpm: Fix initialization of the cdev (by Jason Gunthorpe in 2015) `5b28dde` [media] media: fix use-after-free in cdev_put() when app exits after driver unbind (by Shauh Khan in 2016) This technique is similarly done in at least 15 places within the kernel and probably should have been done so in another, at least, 5 places. The kobj line also looks very suspect in that one would not expect drivers to have to mess with kobject internals in this way. Even highly experienced kernel developers can be surprised by this code, as seen in [2]. To help alleviate this situation, and hopefully prevent future wasted effort on this problem, this patch introduces a helper function to register a char device along with its parent struct device. This creates a more regular API for tying a char device to its parent without the developer having to set members in the underlying kobject. This patch introduce cdev_device_add and cdev_device_del which replaces a common pattern including setting the kobj parent, calling cdev_add and then calling device_add. It also introduces cdev_set_parent for the few cases that set the kobject parent without using device_add. [1] https://lkml.org/lkml/2017/2/13/700 [2] https://lkml.org/lkml/2017/2/10/370 Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Hans Verkuil <hans.verkuil@cisco.com> Reviewed-by: Alexandre Belloni <alexandre.belloni@free-electrons.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-15 17:27:22 +09:00
Jan Kara	29df7dff20	blktrace: Protect q->blk_trace with RCU commit `c780e86dd4` upstream. KASAN is reporting that __blk_add_trace() has a use-after-free issue when accessing q->blk_trace. Indeed the switching of block tracing (and thus eventual freeing of q->blk_trace) is completely unsynchronized with the currently running tracing and thus it can happen that the blk_trace structure is being freed just while __blk_add_trace() works on it. Protect accesses to q->blk_trace by RCU during tracing and make sure we wait for the end of RCU grace period when shutting down tracing. Luckily that is rare enough event that we can afford that. Note that postponing the freeing of blk_trace to an RCU callback should better be avoided as it could have unexpected user visible side-effects as debugfs files would be still existing for a short while block tracing has been shut down. Link: https://bugzilla.kernel.org/show_bug.cgi?id=205711 CC: stable@vger.kernel.org Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Tested-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reported-by: Tristan Madani <tristmd@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk> [bwh: Backported to 4.9: adjust context] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-15 17:27:16 +09:00
Waiman Long	0555930589	blktrace: Fix potential deadlock between delete & sysfs ops commit `5acb3cc2c2` upstream. The lockdep code had reported the following unsafe locking scenario: CPU0 CPU1 ---- ---- lock(s_active#228); lock(&bdev->bd_mutex/1); lock(s_active#228); lock(&bdev->bd_mutex); * DEADLOCK * The deadlock may happen when one task (CPU1) is trying to delete a partition in a block device and another task (CPU0) is accessing tracing sysfs file (e.g. /sys/block/dm-1/trace/act_mask) in that partition. The s_active isn't an actual lock. It is a reference count (kn->count) on the sysfs (kernfs) file. Removal of a sysfs file, however, require a wait until all the references are gone. The reference count is treated like a rwsem using lockdep instrumentation code. The fact that a thread is in the sysfs callback method or in the ioctl call means there is a reference to the opended sysfs or device file. That should prevent the underlying block structure from being removed. Instead of using bd_mutex in the block_device structure, a new blk_trace_mutex is now added to the request_queue structure to protect access to the blk_trace structure. Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Fix typo in patch subject line, and prune a comment detailing how the code used to work. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-15 17:27:13 +09:00
Thomas Pedersen	5c4bb06ea7	mac80211: add ieee80211_is_any_nullfunc() commit `30b2f0be23` upstream. commit `08a5bdde38` ("mac80211: consider QoS Null frames for STA_NULLFUNC_ACKED") Fixed a bug where we failed to take into account a nullfunc frame can be either non-QoS or QoS. It turns out there is at least one more bug in ieee80211_sta_tx_notify(), introduced in commit `7b6ddeaf27` ("mac80211: use QoS NDP for AP probing"), where we forgot to check for the QoS variant and so assumed the QoS nullfunc frame never went out Fix this by adding a helper ieee80211_is_any_nullfunc() which consolidates the check for non-QoS and QoS nullfunc frames. Replace existing compound conditionals and add a couple more missing checks for QoS variant. Signed-off-by: Thomas Pedersen <thomas@adapt-ip.com> Link: https://lore.kernel.org/r/20200114055940.18502-3-thomas@adapt-ip.com Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 17:26:42 +09:00
Sean Christopherson	9a969bd589	KVM: Check validity of resolved slot when searching memslots commit `b6467ab142` upstream. Check that the resolved slot (somewhat confusingly named 'start') is a valid/allocated slot before doing the final comparison to see if the specified gfn resides in the associated slot. The resolved slot can be invalid if the binary search loop terminated because the search index was incremented beyond the number of used slots. This bug has existed since the binary search algorithm was introduced, but went unnoticed because KVM statically allocated memory for the max number of slots, i.e. the access would only be truly out-of-bounds if all possible slots were allocated and the specified gfn was less than the base of the lowest memslot. Commit `36947254e5` ("KVM: Dynamically size memslot array based on number of used slots") eliminated the "all possible slots allocated" condition and made the bug embarrasingly easy to hit. Fixes: `9c1a5d3878` ("kvm: optimize GFN to memslot lookup with large slots amount") Reported-by: syzbot+d889b59b2bb87d4047a2@syzkaller.appspotmail.com Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200408064059.8957-2-sean.j.christopherson@intel.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 17:23:33 +09:00
Jann Horn	c78aa09e33	vmalloc: fix remap_vmalloc_range() bounds checks commit `bdebd6a283` upstream. remap_vmalloc_range() has had various issues with the bounds checks it promises to perform ("This function checks that addr is a valid vmalloc'ed area, and that it is big enough to cover the vma") over time, e.g.: - not detecting pgoff<<PAGE_SHIFT overflow - not detecting (pgoff<<PAGE_SHIFT)+usize overflow - not checking whether addr and addr+(pgoff<<PAGE_SHIFT) are the same vmalloc allocation - comparing a potentially wildly out-of-bounds pointer with the end of the vmalloc region In particular, since commit `fc9702273e` ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY"), unprivileged users can cause kernel null pointer dereferences by calling mmap() on a BPF map with a size that is bigger than the distance from the start of the BPF map to the end of the address space. This could theoretically be used as a kernel ASLR bypass, by using whether mmap() with a given offset oopses or returns an error code to perform a binary search over the possible address range. To allow remap_vmalloc_range_partial() to verify that addr and addr+(pgoff<<PAGE_SHIFT) are in the same vmalloc region, pass the offset to remap_vmalloc_range_partial() instead of adding it to the pointer in remap_vmalloc_range(). In remap_vmalloc_range_partial(), fix the check against get_vm_area_size() by using size comparisons instead of pointer comparisons, and add checks for pgoff. Fixes: `833423143c` ("[PATCH] mm: introduce remap_vmalloc_range()") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: stable@vger.kernel.org Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Martin KaFai Lau <kafai@fb.com> Cc: Song Liu <songliubraving@fb.com> Cc: Yonghong Song <yhs@fb.com> Cc: Andrii Nakryiko <andriin@fb.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: KP Singh <kpsingh@chromium.org> Link: http://lkml.kernel.org/r/20200415222312.236431-1-jannh@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 17:23:21 +09:00
Jason Gunthorpe	8e3d536e93	overflow.h: Add arithmetic shift helper commit `0c66847793` upstream. Add shift_overflow() helper to assist driver authors in ensuring that shift operations don't cause overflows or other odd conditions. Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> [kees: tweaked comments and commit log, dropped unneeded assignment] Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 17:23:20 +09:00
Vegard Nossum	63452e45e7	compiler.h: fix error in BUILD_BUG_ON() reporting [ Upstream commit `af9c5d2e3b` ] compiletime_assert() uses __LINE__ to create a unique function name. This means that if you have more than one BUILD_BUG_ON() in the same source line (which can happen if they appear e.g. in a macro), then the error message from the compiler might output the wrong condition. For this source file: #include <linux/build_bug.h> #define macro() \ BUILD_BUG_ON(1); \ BUILD_BUG_ON(0); void foo() { macro(); } gcc would output: ./include/linux/compiler.h:350:38: error: call to `__compiletime_assert_9' declared with attribute error: BUILD_BUG_ON failed: 0 _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__) However, it was not the BUILD_BUG_ON(0) that failed, so it should say 1 instead of 0. With this patch, we use __COUNTER__ instead of __LINE__, so each BUILD_BUG_ON() gets a different function name and the correct condition is printed: ./include/linux/compiler.h:350:38: error: call to `__compiletime_assert_0' declared with attribute error: BUILD_BUG_ON failed: 1 _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Masahiro Yamada <yamada.masahiro@socionext.com> Reviewed-by: Daniel Santos <daniel.santos@pobox.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Ian Abbott <abbotti@mev.co.uk> Cc: Joe Perches <joe@perches.com> Link: http://lkml.kernel.org/r/20200331112637.25047-1-vegard.nossum@oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-15 17:16:12 +09:00
Qian Cai	8094105e0d	percpu_counter: fix a data race at vm_committed_as [ Upstream commit `7e23452002` ] "vm_committed_as.count" could be accessed concurrently as reported by KCSAN, BUG: KCSAN: data-race in __vm_enough_memory / percpu_counter_add_batch write to 0xffffffff9451c538 of 8 bytes by task 65879 on cpu 35: percpu_counter_add_batch+0x83/0xd0 percpu_counter_add_batch at lib/percpu_counter.c:91 __vm_enough_memory+0xb9/0x260 dup_mm+0x3a4/0x8f0 copy_process+0x2458/0x3240 _do_fork+0xaa/0x9f0 __do_sys_clone+0x125/0x160 __x64_sys_clone+0x70/0x90 do_syscall_64+0x91/0xb05 entry_SYSCALL_64_after_hwframe+0x49/0xbe read to 0xffffffff9451c538 of 8 bytes by task 66773 on cpu 19: __vm_enough_memory+0x199/0x260 percpu_counter_read_positive at include/linux/percpu_counter.h:81 (inlined by) __vm_enough_memory at mm/util.c:839 mmap_region+0x1b2/0xa10 do_mmap+0x45c/0x700 vm_mmap_pgoff+0xc0/0x130 ksys_mmap_pgoff+0x6e/0x300 __x64_sys_mmap+0x33/0x40 do_syscall_64+0x91/0xb05 entry_SYSCALL_64_after_hwframe+0x49/0xbe The read is outside percpu_counter::lock critical section which results in a data race. Fix it by adding a READ_ONCE() in percpu_counter_read_positive() which could also service as the existing compiler memory barrier. Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Marco Elver <elver@google.com> Link: http://lkml.kernel.org/r/1582302724-2804-1-git-send-email-cai@lca.pw Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-15 17:16:11 +09:00
Eric W. Biederman	2f9b5a87ef	signal: Extend exec_id to 64bits commit `d1e7fd6462` upstream. Replace the 32bit exec_id with a 64bit exec_id to make it impossible to wrap the exec_id counter. With care an attacker can cause exec_id wrap and send arbitrary signals to a newly exec'd parent. This bypasses the signal sending checks if the parent changes their credentials during exec. The severity of this problem can been seen that in my limited testing of a 32bit exec_id it can take as little as 19s to exec 65536 times. Which means that it can take as little as 14 days to wrap a 32bit exec_id. Adam Zabrocki has succeeded wrapping the self_exe_id in 7 days. Even my slower timing is in the uptime of a typical server. Which means self_exec_id is simply a speed bump today, and if exec gets noticably faster self_exec_id won't even be a speed bump. Extending self_exec_id to 64bits introduces a problem on 32bit architectures where reading self_exec_id is no longer atomic and can take two read instructions. Which means that is is possible to hit a window where the read value of exec_id does not match the written value. So with very lucky timing after this change this still remains expoiltable. I have updated the update of exec_id on exec to use WRITE_ONCE and the read of exec_id in do_notify_parent to use READ_ONCE to make it clear that there is no locking between these two locations. Link: https://lore.kernel.org/kernel-hardening/20200324215049.GA3710@pi3.com.pl Fixes: 2.3.23pre2 Cc: stable@vger.kernel.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 17:14:27 +09:00
Martin Blumenstingl	b5293f3477	thermal: devfreq_cooling: inline all stubs for CONFIG_DEVFREQ_THERMAL=n commit `3f5b995904` upstream. When CONFIG_DEVFREQ_THERMAL is disabled all functions except of_devfreq_cooling_register_power() were already inlined. Also inline the last function to avoid compile errors when multiple drivers call of_devfreq_cooling_register_power() when CONFIG_DEVFREQ_THERMAL is not set. Compilation failed with the following message: multiple definition of `of_devfreq_cooling_register_power' (which then lists all usages of of_devfreq_cooling_register_power()) Thomas Zimmermann reported this problem [0] on a kernel config with CONFIG_DRM_LIMA={m,y}, CONFIG_DRM_PANFROST={m,y} and CONFIG_DEVFREQ_THERMAL=n after both, the lima and panfrost drivers gained devfreq cooling support. [0] https://www.spinics.net/lists/dri-devel/msg252825.html Fixes: `a76caf55e5` ("thermal: Add devfreq cooling") Cc: stable@vger.kernel.org Reported-by: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Tested-by: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Link: https://lore.kernel.org/r/20200403205133.1101808-1-martin.blumenstingl@googlemail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 17:14:20 +09:00
Peter Zijlstra	e8aeae5a10	locking/atomic, kref: Add kref_read() commit `2c935bc572` upstream. Since we need to change the implementation, stop exposing internals. Provide kref_read() to read the current reference count; typically used for debug messages. Kills two anti-patterns: atomic_read(&kref->refcount) kref->refcount.counter Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> [only add kref_read() to kref.h for stable backports - gregkh] Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 17:12:36 +09:00
Jiri Slaby	c6795bdab4	vt: switch vt_dont_switch to bool commit `f400991bf8` upstream. vt_dont_switch is pure boolean, no need for whole char. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Link: https://lore.kernel.org/r/20200219073951.16151-6-jslaby@suse.cz Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 17:12:32 +09:00
Jiri Slaby	80f2af4913	vt: selection, introduce vc_is_sel commit `dce05aa6ee` upstream. Avoid global variables (namely sel_cons) by introducing vc_is_sel. It checks whether the parameter is the current selection console. This will help putting sel_cons to a struct later. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Link: https://lore.kernel.org/r/20200219073951.16151-1-jslaby@suse.cz Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 17:12:28 +09:00
Peter Zijlstra	935c79177b	futex: Fix inode life-time issue commit `8019ad13ef` upstream. As reported by Jann, ihold() does not in fact guarantee inode persistence. And instead of making it so, replace the usage of inode pointers with a per boot, machine wide, unique inode identifier. This sequence number is global, but shared (file backed) futexes are rare enough that this should not become a performance issue. Reported-by: Jann Horn <jannh@google.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 17:11:10 +09:00
Joerg Roedel	70b0b5b440	x86/mm: split vmalloc_sync_all() commit `763802b53a` upstream. Commit `3f8fd02b1b` ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()") introduced a call to vmalloc_sync_all() in the vunmap() code-path. While this change was necessary to maintain correctness on x86-32-pae kernels, it also adds additional cycles for architectures that don't need it. Specifically on x86-64 with CONFIG_VMAP_STACK=y some people reported severe performance regressions in micro-benchmarks because it now also calls the x86-64 implementation of vmalloc_sync_all() on vunmap(). But the vmalloc_sync_all() implementation on x86-64 is only needed for newly created mappings. To avoid the unnecessary work on x86-64 and to gain the performance back, split up vmalloc_sync_all() into two functions: * vmalloc_sync_mappings(), and * vmalloc_sync_unmappings() Most call-sites to vmalloc_sync_all() only care about new mappings being synchronized. The only exception is the new call-site added in the above mentioned commit. Shile Zhang directed us to a report of an 80% regression in reaim throughput. Fixes: `3f8fd02b1b` ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()") Reported-by: kernel test robot <oliver.sang@intel.com> Reported-by: Shile Zhang <shile.zhang@linux.alibaba.com> Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Borislav Petkov <bp@suse.de> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [GHES] Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20191009124418.8286-1-joro@8bytes.org Link: https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/ Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 17:11:06 +09:00
Heiner Kallweit	84f1052ab2	net: phy: fix MDIO bus PM PHY resuming [ Upstream commit `611d779af7` ] So far we have the unfortunate situation that mdio_bus_phy_may_suspend() is called in suspend AND resume path, assuming that function result is the same. After the original change this is no longer the case, resulting in broken resume as reported by Geert. To fix this call mdio_bus_phy_may_suspend() in the suspend path only, and let the phy_device store the info whether it was suspended by MDIO bus PM. Fixes: `503ba7c696` ("net: phy: Avoid multiple suspends") Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 17:09:21 +09:00

1 2 3 4 5 ...

53926 Commits