[ Upstream commit bba1b2a890 ]
Previously, when this sample is added, commit 1c47910ef8
("samples/bpf: add perf_event+bpf example"), a symbol 'sys_read' and
'sys_write' has been used without no prefixes. But currently there are
no exact symbols with these under kallsyms and this leads to failure.
This commit changes exact compare to substring compare to keep compatible
with exact symbol or prefixed symbol.
Fixes: 1c47910ef8 ("samples/bpf: add perf_event+bpf example")
Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191205080114.19766-2-danieltimlee@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c6b16761c6 ]
The LCD panel on AM4 GP EVMs and ePOS boards seems to be
osd070t1718-19ts. The current dts files say osd057T0559-34ts. Possibly
the panel has changed since the early EVMs, or there has been a mistake
with the panel type.
Update the DT files accordingly.
Acked-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c52c91bb9a ]
When switching ChipSelect from default CS0 to any other CS, driver fails
to update the bits in system control module register that control which
CS is mapped for MMIO access. This causes reads to fail when driver
tries to access QSPI flash on CS1/2/3.
Fix this by updating appropriate bits whenever active CS changes.
Reported-by: Andreas Dannenberg <dannenberg@ti.com>
Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
Link: https://lore.kernel.org/r/20191211155216.30212-1-vigneshr@ti.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c74751f4c3 ]
If any change happened in the configuration of VF in VM while
collecting live dump, there could be a race and firmware can return
more data than allocated dump length. Fix it by keeping track of
the accumulated core dump length copied so far and abort the copy
with error code if the next chunk of core dump will exceed the
original dump length.
Fixes: 6c5657d085 ("bnxt_en: Add support for ethtool get dump.")
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 30e647a764 ]
During definition of the CPU thermal zone of BCM283x SoC family there
was a misunderstanding of the meaning "criticial trip point" and the
thermal throttling range of the VideoCore firmware. The latter one takes
effect when the core temperature is at least 85 degree celsius or higher
So the current critical trip point doesn't make sense, because the
thermal shutdown appears before the firmware has a chance to throttle
the ARM core(s).
Fix these unwanted shutdowns by increasing the critical trip point
to a value which shouldn't be reached with working thermal throttling.
Fixes: 0fe4d2181c ("ARM: dts: bcm283x: Add CPU thermal zone with 1 trip point")
Signed-off-by: Stefan Wahren <wahrenst@gmx.net>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 5cc6c8d4a9 ]
Fix multiple kprobe event testcase to work it correctly.
There are 2 bugfixes.
- Since `wc -l FILE` returns not only line number but also
FILE filename, following "if" statement always failed.
Fix this bug by replacing it with 'cat FILE | wc -l'
- Since "while do-done loop" block with pipeline becomes a
subshell, $N local variable is not update outside of
the loop.
Fix this bug by using actual target number (256) instead
of $N.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0d2c96af79 ]
Userspace might bogusly sent NFT_DATA_VERDICT in several netlink
attributes that assume NFT_DATA_VALUE. Moreover, make sure that error
path invokes nft_data_release() to decrement the reference count on the
chain object.
Fixes: 96518518cc ("netfilter: add nftables")
Fixes: 0f3cd9b369 ("netfilter: nf_tables: add range expression")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit bffc124b6f ]
Only NFTA_SET_ELEM_KEY and NFTA_SET_ELEM_FLAGS make sense for elements
whose NFT_SET_ELEM_INTERVAL_END flag is set on.
Fixes: 96518518cc ("netfilter: add nftables")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit db3b665dd7 ]
The existing rbtree implementation might store consecutive elements
where the closing element and the opening element might overlap, eg.
[ a, a+1) [ a+1, a+2)
This patch removes the optimization for non-anonymous sets in the exact
matching case, where it is assumed to stop searching in case that the
closing element is found. Instead, invalidate candidate interval and
keep looking further in the tree.
The lookup/get operation might return false, while there is an element
in the rbtree. Moreover, the get operation returns true as if a+2 would
be in the tree. This happens with named sets after several set updates.
The existing lookup optimization (that only works for the anonymous
sets) might not reach the opening [ a+1,... element if the closing
...,a+1) is found in first place when walking over the rbtree. Hence,
walking the full tree in that case is needed.
This patch fixes the lookup and get operations.
Fixes: e701001e7c ("netfilter: nft_rbtree: allow adjacent intervals with dynamic updates")
Fixes: ba0e4d9917 ("netfilter: nf_tables: get set elements via netlink")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 164166558a ]
With 'bytes(__u32)' being 32, a left-shift of 31 may happen which is
undefined for the signed 32-bit value 1. Avoid this by declaring 1 as
unsigned.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 2a76352ad2 ]
Currently we add individual copy of same OPP table for each CPU within
the cluster. This is redundant and doesn't reflect the reality.
We can't use core cpumask to set policy->cpus in ve_spc_cpufreq_init()
anymore as it gets called via cpuhp_cpufreq_online()->cpufreq_online()
->cpufreq_driver->init() and the cpumask gets updated upon CPU hotplug
operations. It also may cause issues when the vexpress_spc_cpufreq
driver is built as a module.
Since ve_spc_clk_init is built-in device initcall, we should be able to
use the same topology_core_cpumask to set the opp sharing cpumask via
dev_pm_opp_set_sharing_cpus and use the same later in the driver via
dev_pm_opp_get_sharing_cpus.
Cc: Liviu Dudau <liviu.dudau@arm.com>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0aeb1f2b74 ]
Without this "jedec,spi-nor" compatible property, probing of the SPI NOR
does not work on the NXP i.MX6ULL EVK. Fix this by adding this
compatible property to the DT.
Fixes: 7d77b8505a ("ARM: dts: imx6ull: fix the imx6ull-14x14-evk configuration")
Signed-off-by: Stefan Roese <sr@denx.de>
Reviewed-by: Fabio Estevam <festevam@gmail.com>
Reviewed-by: Frieder Schrempf <frieder.schrempf@kontron.de>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit af16489848 ]
Michael Weiser reported that he got this error during a kexec rebooting:
esrt: Unsupported ESRT version 2904149718861218184.
The ESRT memory stays in EFI boot services data, and it was reserved
in kernel via efi_mem_reserve(). The initial purpose of the reservation
is to reuse the EFI boot services data across kexec reboot. For example
the BGRT image data and some ESRT memory like Michael reported.
But although the memory is reserved it is not updated in the X86 E820 table,
and kexec_file_load() iterates system RAM in the IO resource list to find places
for kernel, initramfs and other stuff. In Michael's case the kexec loaded
initramfs overwrote the ESRT memory and then the failure happened.
Since kexec_file_load() depends on the E820 table being updated, just fix this
by updating the reserved EFI boot services memory as reserved type in E820.
Originally any memory descriptors with EFI_MEMORY_RUNTIME attribute are
bypassed in the reservation code path because they are assumed as reserved.
But the reservation is still needed for multiple kexec reboots,
and it is the only possible case we come here thus just drop the code
chunk, then everything works without side effects.
On my machine the ESRT memory sits in an EFI runtime data range, it does
not trigger the problem, but I successfully tested with BGRT instead.
both kexec_load() and kexec_file_load() work and kdump works as well.
[ mingo: Edited the changelog. ]
Reported-by: Michael Weiser <michael@weiser.dinsnail.net>
Tested-by: Michael Weiser <michael@weiser.dinsnail.net>
Signed-off-by: Dave Young <dyoung@redhat.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kexec@lists.infradead.org
Cc: linux-efi@vger.kernel.org
Link: https://lkml.kernel.org/r/20191204075233.GA10520@dhcp-128-65.nay.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 587db8ebda ]
When we use 'O=' with make to build libtraceevent in a separate folder
it fails to install libtraceevent.a and libtraceevent.so.1.1.0 with the
error:
INSTALL /home/sudip/linux/obj-trace/libtraceevent.a
INSTALL /home/sudip/linux/obj-trace/libtraceevent.so.1.1.0
cp: cannot stat 'libtraceevent.a': No such file or directory
Makefile:225: recipe for target 'install_lib' failed
make: *** [install_lib] Error 1
I used the command:
make O=../../../obj-trace DESTDIR=~/test prefix==/usr install
It turns out libtraceevent Makefile, even though it builds in a separate
folder, searches for libtraceevent.a and libtraceevent.so.1.1.0 in its
source folder.
So, add the 'OUTPUT' prefix to the source path so that 'make' looks for
the files in the correct place.
Signed-off-by: Sudipm Mukherjee <sudipm.mukherjee@gmail.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: linux-trace-devel@vger.kernel.org
Link: http://lore.kernel.org/lkml/20191115113610.21493-1-sudipm.mukherjee@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 1e58252e33 ]
mwifiex_process_tdls_action_frame() without checking
the incoming tdls infomation element's vality before use it,
this may cause multi heap buffer overflows.
Fix them by putting vality check before use it.
IE is TLV struct, but ht_cap and ht_oper aren’t TLV struct.
the origin marvell driver code is wrong:
memcpy(&sta_ptr->tdls_cap.ht_oper, pos,....
memcpy((u8 *)&sta_ptr->tdls_cap.ht_capb, pos,...
Fix the bug by changing pos(the address of IE) to
pos+2 ( the address of IE value ).
Signed-off-by: qize wang <wangqize888888888@gmail.com>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 18a110b022 ]
Curtis Taylor and Jon Maxwell reported and debugged a crash on 3.10
based kernel.
Crash occurs in ctnetlink_conntrack_events because net->nfnl socket is
NULL. The nfnl socket was set to NULL by netns destruction running on
another cpu.
The exiting network namespace calls the relevant destructors in the
following order:
1. ctnetlink_net_exit_batch
This nulls out the event callback pointer in struct netns.
2. nfnetlink_net_exit_batch
This nulls net->nfnl socket and frees it.
3. nf_conntrack_cleanup_net_list
This removes all remaining conntrack entries.
This is order is correct. The only explanation for the crash so ar is:
cpu1: conntrack is dying, eviction occurs:
-> nf_ct_delete()
-> nf_conntrack_event_report \
-> nf_conntrack_eventmask_report
-> notify->fcn() (== ctnetlink_conntrack_events).
cpu1: a. fetches rcu protected pointer to obtain ctnetlink event callback.
b. gets interrupted.
cpu2: runs netns exit handlers:
a runs ctnetlink destructor, event cb pointer set to NULL.
b runs nfnetlink destructor, nfnl socket is closed and set to NULL.
cpu1: c. resumes and trips over NULL net->nfnl.
Problem appears to be that ctnetlink_net_exit_batch only prevents future
callers of nf_conntrack_eventmask_report() from obtaining the callback.
It doesn't wait of other cpus that might have already obtained the
callbacks address.
I don't see anything in upstream kernels that would prevent similar
crash: We need to wait for all cpus to have exited the event callback.
Fixes: 9592a5c01e ("netfilter: ctnetlink: netns support")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 1a365e8223 ]
This fixes various data races in spinlock_debug. By testing with KCSAN,
it is observable that the console gets spammed with data races reports,
suggesting these are extremely frequent.
Example data race report:
read to 0xffff8ab24f403c48 of 4 bytes by task 221 on cpu 2:
debug_spin_lock_before kernel/locking/spinlock_debug.c:85 [inline]
do_raw_spin_lock+0x9b/0x210 kernel/locking/spinlock_debug.c:112
__raw_spin_lock include/linux/spinlock_api_smp.h:143 [inline]
_raw_spin_lock+0x39/0x40 kernel/locking/spinlock.c:151
spin_lock include/linux/spinlock.h:338 [inline]
get_partial_node.isra.0.part.0+0x32/0x2f0 mm/slub.c:1873
get_partial_node mm/slub.c:1870 [inline]
<snip>
write to 0xffff8ab24f403c48 of 4 bytes by task 167 on cpu 3:
debug_spin_unlock kernel/locking/spinlock_debug.c:103 [inline]
do_raw_spin_unlock+0xc9/0x1a0 kernel/locking/spinlock_debug.c:138
__raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:159 [inline]
_raw_spin_unlock_irqrestore+0x2d/0x50 kernel/locking/spinlock.c:191
spin_unlock_irqrestore include/linux/spinlock.h:393 [inline]
free_debug_processing+0x1b3/0x210 mm/slub.c:1214
__slab_free+0x292/0x400 mm/slub.c:2864
<snip>
As a side-effect, with KCSAN, this eventually locks up the console, most
likely due to deadlock, e.g. .. -> printk lock -> spinlock_debug ->
KCSAN detects data race -> kcsan_print_report() -> printk lock ->
deadlock.
This fix will 1) avoid the data races, and 2) allow using lock debugging
together with KCSAN.
Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Marco Elver <elver@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: https://lkml.kernel.org/r/20191120155715.28089-1-elver@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 45dfbf5697 ]
max98090_interrupt() and max98090_pll_work() run in 2 different threads.
There are 2 possible races:
Note: M98090_REG_DEVICE_STATUS = 0x01.
Note: ULK == 0, PLL is locked; ULK == 1, PLL is unlocked.
max98090_interrupt max98090_pll_work
----------------------------------------------
schedule max98090_pll_work
restart max98090 codec
receive ULK INT
assert ULK == 0
schedule max98090_pll_work (1).
In the case (1), the PLL is locked but max98090_interrupt unnecessarily
schedules another max98090_pll_work.
max98090_interrupt max98090_pll_work max98090 codec
----------------------------------------------------------------------
ULK = 1
receive ULK INT
read 0x01
ULK = 0 (clear on read)
schedule max98090_pll_work
restart max98090 codec
ULK = 1
receive ULK INT
read 0x01
ULK = 0 (clear on read)
read 0x01
assert ULK == 0 (2).
In the case (2), both max98090_interrupt and max98090_pll_work read
the same clear-on-read register. max98090_pll_work would falsely
thought PLL is locked.
Note: the case (2) race is introduced by the previous commit ("ASoC:
max98090: exit workaround earlier if PLL is locked") to check the status
and exit the loop earlier in max98090_pll_work.
There are 2 possible solution options:
A. turn off ULK interrupt before scheduling max98090_pll_work; and turn
on again before exiting max98090_pll_work.
B. remove the second thread of execution.
Option A cannot fix the case (2) race because it still has 2 threads
access the same clear-on-read register simultaneously. Although we
could suppose the register is volatile and read the status via I2C could
be much slower than the hardware raises the bits.
Option B introduces a maximum 10~12 msec penalty delay in the interrupt
handler. However, it could only punish the jack detection by extra
10~12 msec.
Adopts option B which is the better solution overall.
Signed-off-by: Tzung-Bi Shih <tzungbi@google.com>
Link: https://lore.kernel.org/r/20191122073114.219945-4-tzungbi@google.com
Reviewed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 8442b02bf3 upstream.
When fuzzing the USB subsystem with syzkaller, we currently use 8 testing
processes within one VM. To isolate testing processes from one another it
is desirable to assign a dedicated USB bus to each of those, which means
we need at least 8 Dummy UDC/HCD devices.
This patch increases the maximum number of Dummy UDC/HCD devices to 32
(more than 8 in case we need more of them in the future).
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Link: https://lore.kernel.org/r/665578f904484069bb6100fb20283b22a046ad9b.1571667489.git.andreyknvl@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit ff61541cc6 ]
Commit
8062382c8d ("perf/x86/intel/bts: Add BTS PMU driver")
brought in a warning with the BTS buffer initialization
that is easily tripped with (assuming KPTI is disabled):
instantly throwing:
> ------------[ cut here ]------------
> WARNING: CPU: 2 PID: 326 at arch/x86/events/intel/bts.c:86 bts_buffer_setup_aux+0x117/0x3d0
> Modules linked in:
> CPU: 2 PID: 326 Comm: perf Not tainted 5.4.0-rc8-00291-gceb9e77324fa #904
> RIP: 0010:bts_buffer_setup_aux+0x117/0x3d0
> Call Trace:
> rb_alloc_aux+0x339/0x550
> perf_mmap+0x607/0xc70
> mmap_region+0x76b/0xbd0
...
It appears to assume (for lost raisins) that PagePrivate() is set,
while later it actually tests for PagePrivate() before using
page_private().
Make it consistent and always check PagePrivate() before using
page_private().
Fixes: 8062382c8d ("perf/x86/intel/bts: Add BTS PMU driver")
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: https://lkml.kernel.org/r/20191205142853.28894-2-alexander.shishkin@linux.intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit f9bd84a8a8 ]
For each I/O request, blkback first maps the foreign pages for the
request to its local pages. If an allocation of a local page for the
mapping fails, it should unmap every mapping already made for the
request.
However, blkback's handling mechanism for the allocation failure does
not mark the remaining foreign pages as unmapped. Therefore, the unmap
function merely tries to unmap every valid grant page for the request,
including the pages not mapped due to the allocation failure. On a
system that fails the allocation frequently, this problem leads to
following kernel crash.
[ 372.012538] BUG: unable to handle kernel NULL pointer dereference at 0000000000000001
[ 372.012546] IP: [<ffffffff814071ac>] gnttab_unmap_refs.part.7+0x1c/0x40
[ 372.012557] PGD 16f3e9067 PUD 16426e067 PMD 0
[ 372.012562] Oops: 0002 [#1] SMP
[ 372.012566] Modules linked in: act_police sch_ingress cls_u32
...
[ 372.012746] Call Trace:
[ 372.012752] [<ffffffff81407204>] gnttab_unmap_refs+0x34/0x40
[ 372.012759] [<ffffffffa0335ae3>] xen_blkbk_unmap+0x83/0x150 [xen_blkback]
...
[ 372.012802] [<ffffffffa0336c50>] dispatch_rw_block_io+0x970/0x980 [xen_blkback]
...
Decompressing Linux... Parsing ELF... done.
Booting the kernel.
[ 0.000000] Initializing cgroup subsys cpuset
This commit fixes this problem by marking the grant pages of the given
request that didn't mapped due to the allocation failure as invalid.
Fixes: c6cc142dac ("xen-blkback: use balloon pages for all mappings")
Reviewed-by: David Woodhouse <dwmw@amazon.de>
Reviewed-by: Maximilian Heyne <mheyne@amazon.de>
Reviewed-by: Paul Durrant <pdurrant@amazon.co.uk>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 72a81ad9d6 ]
If an SMT capable system is not IPL'ed from the first CPU the setup of
the physical to logical CPU mapping is broken: the IPL core gets CPU
number 0, but then the next core gets CPU number 1. Correct would be
that all SMT threads of CPU 0 get the subsequent logical CPU numbers.
This is important since a lot of code (like e.g. the CPU topology
code) assumes that CPU maps are setup like this. If the mapping is
broken the system will not IPL due to broken topology masks:
[ 1.716341] BUG: arch topology broken
[ 1.716342] the SMT domain not a subset of the MC domain
[ 1.716343] BUG: arch topology broken
[ 1.716344] the MC domain not a subset of the BOOK domain
This scenario can usually not happen since LPARs are always IPL'ed
from CPU 0 and also re-IPL is intiated from CPU 0. However older
kernels did initiate re-IPL on an arbitrary CPU. If therefore a re-IPL
from an old kernel into a new kernel is initiated this may lead to
crash.
Fix this by setting up the physical to logical CPU mapping correctly.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 6abf572621 ]
Running stress-test test_2 in mtd-utils on ubi device, sometimes we can
get following oops message:
BUG: unable to handle page fault for address: ffffffff00000140
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 280a067 P4D 280a067 PUD 0
Oops: 0000 [#1] SMP
CPU: 0 PID: 60 Comm: kworker/u16:1 Kdump: loaded Not tainted 5.2.0 #13
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0
-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
Workqueue: writeback wb_workfn (flush-ubifs_0_0)
RIP: 0010:rb_next_postorder+0x2e/0xb0
Code: 80 db 03 01 48 85 ff 0f 84 97 00 00 00 48 8b 17 48 83 05 bc 80 db
03 01 48 83 e2 fc 0f 84 82 00 00 00 48 83 05 b2 80 db 03 01 <48> 3b 7a
10 48 89 d0 74 02 f3 c3 48 8b 52 08 48 83 05 a3 80 db 03
RSP: 0018:ffffc90000887758 EFLAGS: 00010202
RAX: ffff888129ae4700 RBX: ffff888138b08400 RCX: 0000000080800001
RDX: ffffffff00000130 RSI: 0000000080800024 RDI: ffff888138b08400
RBP: ffff888138b08400 R08: ffffea0004a6b920 R09: 0000000000000000
R10: ffffc90000887740 R11: 0000000000000001 R12: ffff888128d48000
R13: 0000000000000800 R14: 000000000000011e R15: 00000000000007c8
FS: 0000000000000000(0000) GS:ffff88813ba00000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffff00000140 CR3: 000000013789d000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
destroy_old_idx+0x5d/0xa0 [ubifs]
ubifs_tnc_start_commit+0x4fe/0x1380 [ubifs]
do_commit+0x3eb/0x830 [ubifs]
ubifs_run_commit+0xdc/0x1c0 [ubifs]
Above Oops are due to the slab-out-of-bounds happened in do-while of
function layout_in_gaps indirectly called by ubifs_tnc_start_commit. In
function layout_in_gaps, there is a do-while loop placing index nodes
into the gaps created by obsolete index nodes in non-empty index LEBs
until rest index nodes can totally be placed into pre-allocated empty
LEBs. @c->gap_lebs points to a memory area(integer array) which records
LEB numbers used by 'in-the-gaps' method. Whenever a fitable index LEB
is found, corresponding lnum will be incrementally written into the
memory area pointed by @c->gap_lebs. The size
((@c->lst.idx_lebs + 1) * sizeof(int)) of memory area is allocated before
do-while loop and can not be changed in the loop. But @c->lst.idx_lebs
could be increased by function ubifs_change_lp (called by
layout_leb_in_gaps->ubifs_find_dirty_idx_leb->get_idx_gc_leb) during the
loop. So, sometimes oob happens when number of cycles in do-while loop
exceeds the original value of @c->lst.idx_lebs. See detail in
https://bugzilla.kernel.org/show_bug.cgi?id=204229.
This patch fixes oob in layout_in_gaps.
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 5d1116d4c6 ]
Christoph Hellwig complained about the following soft lockup warning
when running scrub after generic/175 when preemption is disabled and
slub debugging is enabled:
watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [xfs_scrub:161]
Modules linked in:
irq event stamp: 41692326
hardirqs last enabled at (41692325): [<ffffffff8232c3b7>] _raw_0
hardirqs last disabled at (41692326): [<ffffffff81001c5a>] trace0
softirqs last enabled at (41684994): [<ffffffff8260031f>] __do_e
softirqs last disabled at (41684987): [<ffffffff81127d8c>] irq_e0
CPU: 3 PID: 16189 Comm: xfs_scrub Not tainted 5.4.0-rc3+ #30
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.124
RIP: 0010:_raw_spin_unlock_irqrestore+0x39/0x40
Code: 89 f3 be 01 00 00 00 e8 d5 3a e5 fe 48 89 ef e8 ed 87 e5 f2
RSP: 0018:ffffc9000233f970 EFLAGS: 00000286 ORIG_RAX: ffffffffff3
RAX: ffff88813b398040 RBX: 0000000000000286 RCX: 0000000000000006
RDX: 0000000000000006 RSI: ffff88813b3988c0 RDI: ffff88813b398040
RBP: ffff888137958640 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffea00042b0c00
R13: 0000000000000001 R14: ffff88810ac32308 R15: ffff8881376fc040
FS: 00007f6113dea700(0000) GS:ffff88813bb80000(0000) knlGS:00000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f6113de8ff8 CR3: 000000012f290000 CR4: 00000000000006e0
Call Trace:
free_debug_processing+0x1dd/0x240
__slab_free+0x231/0x410
kmem_cache_free+0x30e/0x360
xchk_ag_btcur_free+0x76/0xb0
xchk_ag_free+0x10/0x80
xchk_bmap_iextent_xref.isra.14+0xd9/0x120
xchk_bmap_iextent+0x187/0x210
xchk_bmap+0x2e0/0x3b0
xfs_scrub_metadata+0x2e7/0x500
xfs_ioc_scrub_metadata+0x4a/0xa0
xfs_file_ioctl+0x58a/0xcd0
do_vfs_ioctl+0xa0/0x6f0
ksys_ioctl+0x5b/0x90
__x64_sys_ioctl+0x11/0x20
do_syscall_64+0x4b/0x1a0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
If preemption is disabled, all metadata buffers needed to perform the
scrub are already in memory, and there are a lot of records to check,
it's possible that the scrub thread will run for an extended period of
time without sleeping for IO or any other reason. Then the watchdog
timer or the RCU stall timeout can trigger, producing the backtrace
above.
To fix this problem, call cond_resched() from the scrub thread so that
we back out to the scheduler whenever necessary.
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 5343da4c17 ]
Current code doesn't limit the number of nested devices.
Nested devices would be handled recursively and this needs huge stack
memory. So, unlimited nested devices could make stack overflow.
This patch adds upper_level and lower_level, they are common variables
and represent maximum lower/upper depth.
When upper/lower device is attached or dettached,
{lower/upper}_level are updated. and if maximum depth is bigger than 8,
attach routine fails and returns -EMLINK.
In addition, this patch converts recursive routine of
netdev_walk_all_{lower/upper} to iterator routine.
Test commands:
ip link add dummy0 type dummy
ip link add link dummy0 name vlan1 type vlan id 1
ip link set vlan1 up
for i in {2..55}
do
let A=$i-1
ip link add vlan$i link vlan$A type vlan id $i
done
ip link del dummy0
Splat looks like:
[ 155.513226][ T908] BUG: KASAN: use-after-free in __unwind_start+0x71/0x850
[ 155.514162][ T908] Write of size 88 at addr ffff8880608a6cc0 by task ip/908
[ 155.515048][ T908]
[ 155.515333][ T908] CPU: 0 PID: 908 Comm: ip Not tainted 5.4.0-rc3+ #96
[ 155.516147][ T908] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 155.517233][ T908] Call Trace:
[ 155.517627][ T908]
[ 155.517918][ T908] Allocated by task 0:
[ 155.518412][ T908] (stack is not available)
[ 155.518955][ T908]
[ 155.519228][ T908] Freed by task 0:
[ 155.519885][ T908] (stack is not available)
[ 155.520452][ T908]
[ 155.520729][ T908] The buggy address belongs to the object at ffff8880608a6ac0
[ 155.520729][ T908] which belongs to the cache names_cache of size 4096
[ 155.522387][ T908] The buggy address is located 512 bytes inside of
[ 155.522387][ T908] 4096-byte region [ffff8880608a6ac0, ffff8880608a7ac0)
[ 155.523920][ T908] The buggy address belongs to the page:
[ 155.524552][ T908] page:ffffea0001822800 refcount:1 mapcount:0 mapping:ffff88806c657cc0 index:0x0 compound_mapcount:0
[ 155.525836][ T908] flags: 0x100000000010200(slab|head)
[ 155.526445][ T908] raw: 0100000000010200 ffffea0001813808 ffffea0001a26c08 ffff88806c657cc0
[ 155.527424][ T908] raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000
[ 155.528429][ T908] page dumped because: kasan: bad access detected
[ 155.529158][ T908]
[ 155.529410][ T908] Memory state around the buggy address:
[ 155.530060][ T908] ffff8880608a6b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 155.530971][ T908] ffff8880608a6c00: fb fb fb fb fb f1 f1 f1 f1 00 f2 f2 f2 f3 f3 f3
[ 155.531889][ T908] >ffff8880608a6c80: f3 fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 155.532806][ T908] ^
[ 155.533509][ T908] ffff8880608a6d00: fb fb fb fb fb fb fb fb fb f1 f1 f1 f1 00 00 00
[ 155.534436][ T908] ffff8880608a6d80: f2 f3 f3 f3 f3 fb fb fb 00 00 00 00 00 00 00 00
[ ... ]
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit dba7d9b8c7 ]
There are few places where we fetch tp->rcv_nxt while
this field can change from IRQ or other cpu.
We need to add READ_ONCE() annotations, and also make
sure write sides use corresponding WRITE_ONCE() to avoid
store-tearing.
Note that tcp_inq_hint() was already using READ_ONCE(tp->rcv_nxt)
syzbot reported :
BUG: KCSAN: data-race in tcp_poll / tcp_queue_rcv
write to 0xffff888120425770 of 4 bytes by interrupt on cpu 0:
tcp_rcv_nxt_update net/ipv4/tcp_input.c:3365 [inline]
tcp_queue_rcv+0x180/0x380 net/ipv4/tcp_input.c:4638
tcp_rcv_established+0xbf1/0xf50 net/ipv4/tcp_input.c:5616
tcp_v4_do_rcv+0x381/0x4e0 net/ipv4/tcp_ipv4.c:1542
tcp_v4_rcv+0x1a03/0x1bf0 net/ipv4/tcp_ipv4.c:1923
ip_protocol_deliver_rcu+0x51/0x470 net/ipv4/ip_input.c:204
ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
NF_HOOK include/linux/netfilter.h:305 [inline]
NF_HOOK include/linux/netfilter.h:299 [inline]
ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
dst_input include/net/dst.h:442 [inline]
ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
NF_HOOK include/linux/netfilter.h:305 [inline]
NF_HOOK include/linux/netfilter.h:299 [inline]
ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
__netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5004
__netif_receive_skb+0x37/0xf0 net/core/dev.c:5118
netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5208
napi_skb_finish net/core/dev.c:5671 [inline]
napi_gro_receive+0x28f/0x330 net/core/dev.c:5704
receive_buf+0x284/0x30b0 drivers/net/virtio_net.c:1061
read to 0xffff888120425770 of 4 bytes by task 7254 on cpu 1:
tcp_stream_is_readable net/ipv4/tcp.c:480 [inline]
tcp_poll+0x204/0x6b0 net/ipv4/tcp.c:554
sock_poll+0xed/0x250 net/socket.c:1256
vfs_poll include/linux/poll.h:90 [inline]
ep_item_poll.isra.0+0x90/0x190 fs/eventpoll.c:892
ep_send_events_proc+0x113/0x5c0 fs/eventpoll.c:1749
ep_scan_ready_list.constprop.0+0x189/0x500 fs/eventpoll.c:704
ep_send_events fs/eventpoll.c:1793 [inline]
ep_poll+0xe3/0x900 fs/eventpoll.c:1930
do_epoll_wait+0x162/0x180 fs/eventpoll.c:2294
__do_sys_epoll_pwait fs/eventpoll.c:2325 [inline]
__se_sys_epoll_pwait fs/eventpoll.c:2311 [inline]
__x64_sys_epoll_pwait+0xcd/0x170 fs/eventpoll.c:2311
do_syscall_64+0xcf/0x2f0 arch/x86/entry/common.c:296
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 7254 Comm: syz-fuzzer Not tainted 5.3.0+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit f0308fb070 ]
If an ICMP packet comes in on the UDP socket backing an AF_RXRPC socket as
the UDP socket is being shut down, rxrpc_error_report() may get called to
deal with it after sk_user_data on the UDP socket has been cleared, leading
to a NULL pointer access when this local endpoint record gets accessed.
Fix this by just returning immediately if sk_user_data was NULL.
The oops looks like the following:
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
...
RIP: 0010:rxrpc_error_report+0x1bd/0x6a9
...
Call Trace:
? sock_queue_err_skb+0xbd/0xde
? __udp4_lib_err+0x313/0x34d
__udp4_lib_err+0x313/0x34d
icmp_unreach+0x1ee/0x207
icmp_rcv+0x25b/0x28f
ip_protocol_deliver_rcu+0x95/0x10e
ip_local_deliver+0xe9/0x148
__netif_receive_skb_one_core+0x52/0x6e
process_backlog+0xdc/0x177
net_rx_action+0xf9/0x270
__do_softirq+0x1b6/0x39a
? smpboot_register_percpu_thread+0xce/0xce
run_ksoftirqd+0x1d/0x42
smpboot_thread_fn+0x19e/0x1b3
kthread+0xf1/0xf6
? kthread_delayed_work_timer_fn+0x83/0x83
ret_from_fork+0x24/0x30
Fixes: 17926a7932 ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both")
Reported-by: syzbot+611164843bd48cc2190c@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 3a83f677a6 ]
On a 2-socket Power9 system with 32 cores/128 threads (SMT4) and 1TB
of memory running the following guest configs:
guest A:
- 224GB of memory
- 56 VCPUs (sockets=1,cores=28,threads=2), where:
VCPUs 0-1 are pinned to CPUs 0-3,
VCPUs 2-3 are pinned to CPUs 4-7,
...
VCPUs 54-55 are pinned to CPUs 108-111
guest B:
- 4GB of memory
- 4 VCPUs (sockets=1,cores=4,threads=1)
with the following workloads (with KSM and THP enabled in all):
guest A:
stress --cpu 40 --io 20 --vm 20 --vm-bytes 512M
guest B:
stress --cpu 4 --io 4 --vm 4 --vm-bytes 512M
host:
stress --cpu 4 --io 4 --vm 2 --vm-bytes 256M
the below soft-lockup traces were observed after an hour or so and
persisted until the host was reset (this was found to be reliably
reproducible for this configuration, for kernels 4.15, 4.18, 5.0,
and 5.3-rc5):
[ 1253.183290] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1253.183319] rcu: 124-....: (5250 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=1941
[ 1256.287426] watchdog: BUG: soft lockup - CPU#105 stuck for 23s! [CPU 52/KVM:19709]
[ 1264.075773] watchdog: BUG: soft lockup - CPU#24 stuck for 23s! [worker:19913]
[ 1264.079769] watchdog: BUG: soft lockup - CPU#31 stuck for 23s! [worker:20331]
[ 1264.095770] watchdog: BUG: soft lockup - CPU#45 stuck for 23s! [worker:20338]
[ 1264.131773] watchdog: BUG: soft lockup - CPU#64 stuck for 23s! [avocado:19525]
[ 1280.408480] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
[ 1316.198012] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1316.198032] rcu: 124-....: (21003 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=8243
[ 1340.411024] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
[ 1379.212609] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1379.212629] rcu: 124-....: (36756 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=14714
[ 1404.413615] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
[ 1442.227095] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1442.227115] rcu: 124-....: (52509 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=21403
[ 1455.111787] INFO: task worker:19907 blocked for more than 120 seconds.
[ 1455.111822] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
[ 1455.111833] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1455.111884] INFO: task worker:19908 blocked for more than 120 seconds.
[ 1455.111905] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
[ 1455.111925] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1455.111966] INFO: task worker:20328 blocked for more than 120 seconds.
[ 1455.111986] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
[ 1455.111998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1455.112048] INFO: task worker:20330 blocked for more than 120 seconds.
[ 1455.112068] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
[ 1455.112097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1455.112138] INFO: task worker:20332 blocked for more than 120 seconds.
[ 1455.112159] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
[ 1455.112179] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1455.112210] INFO: task worker:20333 blocked for more than 120 seconds.
[ 1455.112231] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
[ 1455.112242] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1455.112282] INFO: task worker:20335 blocked for more than 120 seconds.
[ 1455.112303] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
[ 1455.112332] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1455.112372] INFO: task worker:20336 blocked for more than 120 seconds.
[ 1455.112392] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
CPUs 45, 24, and 124 are stuck on spin locks, likely held by
CPUs 105 and 31.
CPUs 105 and 31 are stuck in smp_call_function_many(), waiting on
target CPU 42. For instance:
# CPU 105 registers (via xmon)
R00 = c00000000020b20c R16 = 00007d1bcd800000
R01 = c00000363eaa7970 R17 = 0000000000000001
R02 = c0000000019b3a00 R18 = 000000000000006b
R03 = 000000000000002a R19 = 00007d537d7aecf0
R04 = 000000000000002a R20 = 60000000000000e0
R05 = 000000000000002a R21 = 0801000000000080
R06 = c0002073fb0caa08 R22 = 0000000000000d60
R07 = c0000000019ddd78 R23 = 0000000000000001
R08 = 000000000000002a R24 = c00000000147a700
R09 = 0000000000000001 R25 = c0002073fb0ca908
R10 = c000008ffeb4e660 R26 = 0000000000000000
R11 = c0002073fb0ca900 R27 = c0000000019e2464
R12 = c000000000050790 R28 = c0000000000812b0
R13 = c000207fff623e00 R29 = c0002073fb0ca808
R14 = 00007d1bbee00000 R30 = c0002073fb0ca800
R15 = 00007d1bcd600000 R31 = 0000000000000800
pc = c00000000020b260 smp_call_function_many+0x3d0/0x460
cfar= c00000000020b270 smp_call_function_many+0x3e0/0x460
lr = c00000000020b20c smp_call_function_many+0x37c/0x460
msr = 900000010288b033 cr = 44024824
ctr = c000000000050790 xer = 0000000000000000 trap = 100
CPU 42 is running normally, doing VCPU work:
# CPU 42 stack trace (via xmon)
[link register ] c00800001be17188 kvmppc_book3s_radix_page_fault+0x90/0x2b0 [kvm_hv]
[c000008ed3343820] c000008ed3343850 (unreliable)
[c000008ed33438d0] c00800001be11b6c kvmppc_book3s_hv_page_fault+0x264/0xe30 [kvm_hv]
[c000008ed33439d0] c00800001be0d7b4 kvmppc_vcpu_run_hv+0x8dc/0xb50 [kvm_hv]
[c000008ed3343ae0] c00800001c10891c kvmppc_vcpu_run+0x34/0x48 [kvm]
[c000008ed3343b00] c00800001c10475c kvm_arch_vcpu_ioctl_run+0x244/0x420 [kvm]
[c000008ed3343b90] c00800001c0f5a78 kvm_vcpu_ioctl+0x470/0x7c8 [kvm]
[c000008ed3343d00] c000000000475450 do_vfs_ioctl+0xe0/0xc70
[c000008ed3343db0] c0000000004760e4 ksys_ioctl+0x104/0x120
[c000008ed3343e00] c000000000476128 sys_ioctl+0x28/0x80
[c000008ed3343e20] c00000000000b388 system_call+0x5c/0x70
--- Exception: c00 (System Call) at 00007d545cfd7694
SP (7d53ff7edf50) is in userspace
It was subsequently found that ipi_message[PPC_MSG_CALL_FUNCTION]
was set for CPU 42 by at least 1 of the CPUs waiting in
smp_call_function_many(), but somehow the corresponding
call_single_queue entries were never processed by CPU 42, causing the
callers to spin in csd_lock_wait() indefinitely.
Nick Piggin suggested something similar to the following sequence as
a possible explanation (interleaving of CALL_FUNCTION/RESCHEDULE
IPI messages seems to be most common, but any mix of CALL_FUNCTION and
!CALL_FUNCTION messages could trigger it):
CPU
X: smp_muxed_ipi_set_message():
X: smp_mb()
X: message[RESCHEDULE] = 1
X: doorbell_global_ipi(42):
X: kvmppc_set_host_ipi(42, 1)
X: ppc_msgsnd_sync()/smp_mb()
X: ppc_msgsnd() -> 42
42: doorbell_exception(): // from CPU X
42: ppc_msgsync()
105: smp_muxed_ipi_set_message():
105: smb_mb()
// STORE DEFERRED DUE TO RE-ORDERING
--105: message[CALL_FUNCTION] = 1
| 105: doorbell_global_ipi(42):
| 105: kvmppc_set_host_ipi(42, 1)
| 42: kvmppc_set_host_ipi(42, 0)
| 42: smp_ipi_demux_relaxed()
| 42: // returns to executing guest
| // RE-ORDERED STORE COMPLETES
->105: message[CALL_FUNCTION] = 1
105: ppc_msgsnd_sync()/smp_mb()
105: ppc_msgsnd() -> 42
42: local_paca->kvm_hstate.host_ipi == 0 // IPI ignored
105: // hangs waiting on 42 to process messages/call_single_queue
This can be prevented with an smp_mb() at the beginning of
kvmppc_set_host_ipi(), such that stores to message[<type>] (or other
state indicated by the host_ipi flag) are ordered vs. the store to
to host_ipi.
However, doing so might still allow for the following scenario (not
yet observed):
CPU
X: smp_muxed_ipi_set_message():
X: smp_mb()
X: message[RESCHEDULE] = 1
X: doorbell_global_ipi(42):
X: kvmppc_set_host_ipi(42, 1)
X: ppc_msgsnd_sync()/smp_mb()
X: ppc_msgsnd() -> 42
42: doorbell_exception(): // from CPU X
42: ppc_msgsync()
// STORE DEFERRED DUE TO RE-ORDERING
-- 42: kvmppc_set_host_ipi(42, 0)
| 42: smp_ipi_demux_relaxed()
| 105: smp_muxed_ipi_set_message():
| 105: smb_mb()
| 105: message[CALL_FUNCTION] = 1
| 105: doorbell_global_ipi(42):
| 105: kvmppc_set_host_ipi(42, 1)
| // RE-ORDERED STORE COMPLETES
-> 42: kvmppc_set_host_ipi(42, 0)
42: // returns to executing guest
105: ppc_msgsnd_sync()/smp_mb()
105: ppc_msgsnd() -> 42
42: local_paca->kvm_hstate.host_ipi == 0 // IPI ignored
105: // hangs waiting on 42 to process messages/call_single_queue
Fixing this scenario would require an smp_mb() *after* clearing
host_ipi flag in kvmppc_set_host_ipi() to order the store vs.
subsequent processing of IPI messages.
To handle both cases, this patch splits kvmppc_set_host_ipi() into
separate set/clear functions, where we execute smp_mb() prior to
setting host_ipi flag, and after clearing host_ipi flag. These
functions pair with each other to synchronize the sender and receiver
sides.
With that change in place the above workload ran for 20 hours without
triggering any lock-ups.
Fixes: 755563bc79 ("powerpc/powernv: Fixes for hypervisor doorbell handling") # v4.0
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190911223155.16045-1-mdroth@linux.vnet.ibm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 3cfa148826 ]
This exercises kernel code path that deal with addresses that have
a limited lifetime.
Without previous fix, this triggers following crash on net-next:
BUG: KASAN: null-ptr-deref in check_lifetime+0x403/0x670
Read of size 8 at addr 0000000000000010 by task kworker [..]
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 934bda59f2 ]
While developing KASAN for 64-bit book3s, I hit the following stack
over-read.
It occurs because the hypercall to put characters onto the terminal
takes 2 longs (128 bits/16 bytes) of characters at a time, and so
hvc_put_chars() would unconditionally copy 16 bytes from the argument
buffer, regardless of supplied length. However, udbg_hvc_putc() can
call hvc_put_chars() with a single-byte buffer, leading to the error.
==================================================================
BUG: KASAN: stack-out-of-bounds in hvc_put_chars+0xdc/0x110
Read of size 8 at addr c0000000023e7a90 by task swapper/0
CPU: 0 PID: 0 Comm: swapper Not tainted 5.2.0-rc2-next-20190528-02824-g048a6ab4835b #113
Call Trace:
dump_stack+0x104/0x154 (unreliable)
print_address_description+0xa0/0x30c
__kasan_report+0x20c/0x224
kasan_report+0x18/0x30
__asan_report_load8_noabort+0x24/0x40
hvc_put_chars+0xdc/0x110
hvterm_raw_put_chars+0x9c/0x110
udbg_hvc_putc+0x154/0x200
udbg_write+0xf0/0x240
console_unlock+0x868/0xd30
register_console+0x970/0xe90
register_early_udbg_console+0xf8/0x114
setup_arch+0x108/0x790
start_kernel+0x104/0x784
start_here_common+0x1c/0x534
Memory state around the buggy address:
c0000000023e7980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0000000023e7a00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1
>c0000000023e7a80: f1 f1 01 f2 f2 f2 00 00 00 00 00 00 00 00 00 00
^
c0000000023e7b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0000000023e7b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
==================================================================
Document that a 16-byte buffer is requred, and provide it in udbg.
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit cba22d86e0 upstream.
Currently, block device size in not updated on second and further open
for block devices where partition scan is disabled. This is particularly
annoying for example for DVD drives as that means block device size does
not get updated once the media is inserted into a drive if the device is
already open when inserting the media. This is actually always the case
for example when pktcdvd is in use.
Fix the problem by revalidating block device size on every open even for
devices with partition scan disabled.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 731dc48683 upstream.
Factor out code handling revalidation of bdev on disk change into a
common helper.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>