It would be tempting to turn the 'pause' state into a flag.
However, this cannot easily be done as it is updated out of context,
while all the flags expect to only be updated from the vcpu thread.
Turning it into a flag would require to make all flag updates
atomic, which isn't necessary desireable.
Document this, and take this opportunity to move the field next
to the flag sets, filling a hole in the vcpu structure.
Reviewed-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
(cherry picked from commit 0fa4a3137e)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: Icbe091a7af25cb3db7acab281e830cd1472b3821
Now that we can detect flags overflowing their container, reduce
the size of all flag set members in the vcpu struct, turning them
into 8bit quantities.
Even with the FP state enum occupying 32bit, the whole of the state
that was represented by flags is smaller by one byte. Profit!
Reviewed-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
(cherry picked from commit 54ddda919c)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: I3ccd004f5507a72229d4afefbac98abc486d845c
Flags are great, but flags can also be dangerous: it is easy
to encode a flag that is bigger than its container (unless the
container is a u64), and it is easy to construct a flag value
that doesn't fit in the mask that is associated with it.
Add a couple of build-time sanity checks that ensure we catch
these two cases.
Reviewed-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
(cherry picked from commit 5a3984f4ec)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: Id8a9f9f07192b69527ac74ddf0815a869cece648
We really don't want PENDING_EXCEPTION and INCREMENT_PC to ever be
set at the same time, as they are mutually exclusive. Add checks
that will generate a warning should this ever happen.
Reviewed-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
(cherry picked from commit e19f2c6cd1)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: I20efdbeafc1b641f7827f5b3d24870f3e1aff4fa
The aptly named boolean 'sysregs_loaded_on_cpu' tracks whether
some of the vcpu system registers are resident on the physical
CPU when running in VHE mode.
This is obviously a flag in hidding, so let's convert it to
a state flag, since this is solely a host concern (the hypervisor
itself always knows which state we're in).
Reviewed-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
(cherry picked from commit 30b6ab45f8)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: Ia08beccb869e89a3cce0f46d4a08b3e1ee474eab
Horray, we have now sorted all the preexisting flags, and the
'flags' field is now unused. Get rid of it while nobody is
looking.
Reviewed-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
(cherry picked from commit 781e3ae148)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: Iad0259759a02f3823d88f0dfdbb30d9f87b2113e
The host kernel uses the WFIT flag to remember that a vcpu has used
this instruction and wake it up as required. Move it to the state
set, as nothing in the hypervisor uses this information.
Reviewed-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
(cherry picked from commit eebc538d8e)
[willdeacon@: Use kvm_vcpu_block() instead of kvm_vcpu_halt() in context]
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: Iccd6cca64cfb3920be878fa883255e3d6ba2a01c
The ON_UNSUPPORTED_CPU flag is only there to track the sad fact
that we have ended-up on a CPU where we cannot really run.
Since this is only for the host kernel's use, move it to the state
set.
Reviewed-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
(cherry picked from commit aff3ccd732)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: Ida322ae52e86bbf433641cae809ef70b256e539e
The two HOST_{SVE,SME}_ENABLED are only used for the host kernel
to track its own state across a vcpu run so that it can be fully
restored.
Move these flags to the so called state set.
Reviewed-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
(cherry picked from commit 0affa37fcd)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: Id09e1ee6ddb524f8cbcde73c76f89977ad421e7d
The three debug flags (which deal with the debug registers, SPE and
TRBE) all are input flags to the hypervisor code.
Move them into the input set and convert them to the new accessors.
Reviewed-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
(cherry picked from commit b1da49088a)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: Iefae8aeddfbe62070848305ebce44a9120beda6e
Copy the task argument passed to arch_stack_walk() to unwind_state so that
it can be passed to unwind functions via unwind_state rather than as a
separate argument. The task is a fundamental part of the unwind state.
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20220617180219.20352-3-madvenka@linux.microsoft.com
Signed-off-by: Will Deacon <will@kernel.org>
(cherry picked from commit 82a592c13b)
[willdeacon@: Resolve context conflict with kretprobes field in
'struct unwind_state']
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: Icf29c7da8111dd5f7a25d99e6277a99c78e1ec23
unwind_init() is currently a single function that initializes all of the
unwind state. Split it into the following functions and call them
appropriately:
- unwind_init_from_regs() - initialize from regs passed by caller.
- unwind_init_from_caller() - initialize for the current task
from the caller of arch_stack_walk().
- unwind_init_from_task() - initialize from the saved state of a
task other than the current task. In this case, the other
task must not be running.
This is done for two reasons:
- the different ways of initializing are clear
- specialized code can be added to each initializer in the future.
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20220617180219.20352-2-madvenka@linux.microsoft.com
Signed-off-by: Will Deacon <will@kernel.org>
(cherry picked from commit a019d8a2cc)
[willdeacon@: Resolve context conflict with kretprobes code in unwind_init()]
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: I35f738d9cd8ab5f6fa333ac451a32f2486734da4
Disable KASAN instrumentation of arch/arm64/kernel/stacktrace.c.
This speeds up Generic KASAN by 5-20%.
As a side-effect, KASAN is now unable to detect bugs in the stack trace
collection code. This is taken as an acceptable downside.
Also replace READ_ONCE_NOCHECK() with READ_ONCE() in stacktrace.c.
As the file is now not instrumented, there is no need to use the
NOCHECK version of READ_ONCE().
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Link: https://lore.kernel.org/r/c4c944a2a905e949760fbeb29258185087171708.1653317461.git.andreyknvl@google.com
Signed-off-by: Will Deacon <will@kernel.org>
(cherry picked from commit 802b91118d)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: I374277671bb324b494fcf661851aeb5a8a1769c4
For historical reasons, the naming of parameters and their types in the
arm64 stacktrace code differs from that used in generic code and other
architectures, even though the types are equivalent.
For consistency and clarity, use the generic names.
There should be no functional change as a result of this patch.
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Kalesh Singh <kaleshsingh@google.com> for the series.
Link: https://lore.kernel.org/r/20220413145910.3060139-7-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
(cherry picked from commit bd5552bc48)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: I1924f6b77bbe7f8fff9ee145d41364f4cb313948
Now that open-coded stack unwinds have been converted to
arch_stack_walk(), we no longer need to expose any of unwind_frame(),
walk_stackframe(), or start_backtrace() outside of stacktrace.c.
Make those functions private to stacktrace.c, removing their prototypes
from <asm/stacktrace.h> and marking them static.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Cc: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
Link: https://lore.kernel.org/r/20211129142849.3056714-10-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
(cherry picked from commit d2d1d2645c)
[willdeacon@: Retain 'notrace' annotation on start_backtrace() from LTS merge]
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: I3e3288fed5060dd457ecbea2ccad15af105a1224
This reverts commit b7ca6bc390.
Upstream has made this symbol static and subsequently reworked the
unwinder logic, so stop exporting start_backtrace() so that we can
backport the rest of the upstream changes.
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: I3d64e5ceb2da41db1a47a5f78a412a22e5d75f12
To enable RELIABLE_STACKTRACE and LIVEPATCH on arm64, we need to
substantially rework arm64's unwinding code. As part of this, we want to
minimize the set of unwind interfaces we expose, and avoid open-coding
of unwind logic.
Currently, dump_backtrace() walks the stack of the current task or a
blocked task by calling stact_backtrace() and iterating unwind steps
using unwind_frame(). This can be written more simply in terms of
arch_stack_walk(), considering three distinct cases:
1) When unwinding a blocked task, start_backtrace() is called with the
blocked task's saved PC and FP, and the unwind proceeds immediately
from this point without skipping any entries. This is functionally
equivalent to calling arch_stack_walk() with the blocked task, which
will start with the task's saved PC and FP.
There is no functional change to this case.
2) When unwinding the current task without regs, start_backtrace() is
called with dump_backtrace() as the PC and __builtin_frame_address(0)
as the next frame, and the unwind proceeds immediately without
skipping. This is *almost* functionally equivalent to calling
arch_stack_walk() for the current task, which will start with its
caller (i.e. an offset into dump_backtrace()) as the PC, and the
callers frame record as the next frame.
The only difference being that dump_backtrace() will be reported with
an offset (which is strictly more correct than currently). Otherwise
there is no functional cahnge to this case.
3) When unwinding the current task with regs, start_backtrace() is
called with dump_backtrace() as the PC and __builtin_frame_address(0)
as the next frame, and the unwind is performed silently until the
next frame is the frame pointed to by regs->fp. Reporting starts
from regs->pc and continues from the frame in regs->fp.
Historically, this pre-unwind was necessary to correctly record
return addresses rewritten by the ftrace graph calller, but this is
no longer necessary as these are now recovered using the FP since
commit:
c6d3cd32fd ("arm64: ftrace: use HAVE_FUNCTION_GRAPH_RET_ADDR_PTR")
This pre-unwind is not necessary to recover return addresses
rewritten by kretprobes, which historically were not recovered, and
are now recovered using the FP since commit:
cd9bc2c925 ("arm64: Recover kretprobe modified return address in stacktrace")
Thus, this is functionally equivalent to calling arch_stack_walk()
with the current task and regs, which will start with regs->pc as the
PC and regs->fp as the next frame, without a pre-unwind.
This patch makes dump_backtrace() use arch_stack_walk(). This simplifies
dump_backtrace() and will permit subsequent changes to the unwind code.
Aside from the improved reporting when unwinding current without regs,
there should be no functional change as a result of this patch.
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
[Mark: elaborate commit message]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20211129142849.3056714-9-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
(cherry picked from commit 2dad6dc17b)
[willdeacon@: NOTE: We don't have cd9bc2c925 and backporting its
dependencies is not viable so kretprobe unwinding remains broken even
after this patch]
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: I04ca152601bd64f0e46a606af2aa5b35c86b5bf3
To enable RELIABLE_STACKTRACE and LIVEPATCH on arm64, we need to
substantially rework arm64's unwinding code. As part of this, we want to
minimize the set of unwind interfaces we expose, and avoid open-coding
of unwind logic outside of stacktrace.c.
Currently profile_pc() walks the stack of an interrupted context by
calling start_backtrace() with the context's PC and FP, and iterating
unwind steps using walk_stackframe(). This is functionally equivalent to
calling arch_stack_walk() with the interrupted context's pt_regs, which
will start with the PC and FP from the regs.
Make profile_pc() use arch_stack_walk(). This simplifies profile_pc(),
and in future will alow us to make walk_stackframe() private to
stacktrace.c.
At the same time, we remove the early return for when regs->pc is not in
lock functions, as this will be handled by the first call to the
profile_pc_cb() callback.
There should be no functional change as a result of this patch.
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
[Mark: remove early return, elaborate commit message, fix includes]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20211129142849.3056714-8-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
(cherry picked from commit 22ecd975b6)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: I65bdcde177048003f443c9280520bd1924b09491
To enable RELIABLE_STACKTRACE and LIVEPATCH on arm64, we need to
substantially rework arm64's unwinding code. As part of this, we want to
minimize the set of unwind interfaces we expose, and avoid open-coding
of unwind logic outside of stacktrace.c.
Currently return_address() walks the stack of the current task by
calling start_backtrace() with return_address as the PC and the frame
pointer of return_address() as the next frame, iterating unwind steps
using walk_stackframe(). This is functionally equivalent to calling
arch_stack_walk() for the current stack, which will start from its
caller (i.e. return_address()) as the PC and it's caller's frame record
as the next frame.
Make return_address() use arch_stackwalk(). This simplifies
return_address(), and in future will alow us to make walk_stackframe()
private to stacktrace.c.
There should be no functional change as a result of this patch.
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
[Mark: elaborate commit message, fix includes]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20211129142849.3056714-7-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
(cherry picked from commit 39ef362d2d)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: I2d5d45c0d583a662910018c12035effbdc7cb6b9
To enable RELIABLE_STACKTRACE and LIVEPATCH on arm64, we need to
substantially rework arm64's unwinding code. As part of this, we want to
minimize the set of unwind interfaces we expose, and avoid open-coding
of unwind logic outside of stacktrace.c.
Currently, __get_wchan() walks the stack of a blocked task by calling
start_backtrace() with the task's saved PC and FP values, and iterating
unwind steps using unwind_frame(). The initialization is functionally
equivalent to calling arch_stack_walk() with the blocked task, which
will start with the task's saved PC and FP values.
Currently __get_wchan() always performs an initial unwind step, which
will stkip __switch_to(), but as this is now marked as a __sched
function, this no longer needs special handling and will be skipped in
the same way as other sched functions.
Make __get_wchan() use arch_stack_walk(). This simplifies __get_wchan(),
and in future will alow us to make unwind_frame() private to
stacktrace.c. At the same time, we can simplify the try_get_task_stack()
check and avoid the unnecessary `stack_page` variable.
The change to the skipping logic means we may terminate one frame
earlier than previously where there are an excessive number of sched
functions in the trace, but this isn't seen in practice, and wchan is
best-effort anyway, so this should not be a problem.
Other than the above, there should be no functional change as a result
of this patch.
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
[Mark: rebase atop wchan changes, elaborate commit message, fix includes]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20211129142849.3056714-6-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
(cherry picked from commit 4f62bb7cb1)
[willdeacon@: Fix context conflict with task checks in get_wchan()]
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: Iad86e4fcf28e88df2cca65767f984d351b34d3ed
To enable RELIABLE_STACKTRACE and LIVEPATCH on arm64, we need to
substantially rework arm64's unwinding code. As part of this, we want to
minimize the set of unwind interfaces we expose, and avoid open-coding
of unwind logic outside of stacktrace.c.
Currently perf_callchain_kernel() walks the stack of an interrupted
context by calling start_backtrace() with the context's PC and FP, and
iterating unwind steps using walk_stackframe(). This is functionally
equivalent to calling arch_stack_walk() with the interrupted context's
pt_regs, which will start with the PC and FP from the regs.
Make perf_callchain_kernel() use arch_stack_walk(). This simplifies
perf_callchain_kernel(), and in future will alow us to make
walk_stackframe() private to stacktrace.c.
At the same time, we update the callchain_trace() callback to check the
return value of perf_callchain_store(), which indicates whether there is
space for any further entries. When a non-zero value is returned,
further calls will be ignored, and are redundant, so we can stop the
unwind at this point.
We also remove the stale and confusing comment for callchain_trace.
There should be no functional change as a result of this patch.
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
[Mark: elaborate commit message, remove comment, fix includes]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20211129142849.3056714-5-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
(cherry picked from commit ed876d35a1)
[willdeacon@: Fix context conflict with 'perf_guest_cbs' in perf_callchain_kernel()]
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: Ifb2003732409684f02fcf5fc221c26df57295ed2
Unlike most architectures (and only in keeping with powerpc), arm64 has
a non __sched() function on the path to our cpu_switch_to() assembly
function.
It is expected that for a blocked task, in_sched_functions() can be used
to skip all functions between the raw context switch assembly and the
scheduler functions that call into __switch_to(). This is the behaviour
expected by stack_trace_consume_entry_nosched(), and the behaviour we'd
like to have such that we an simplify arm64's __get_wchan()
implementation to use arch_stack_walk().
This patch mark's arm64's __switch_to as __sched. This *will not* change
the behaviour of arm64's current __get_wchan() implementation, which
always performs an initial unwind step which skips __switch_to(). This
*will* change the behaviour of stack_trace_consume_entry_nosched() and
stack_trace_save_tsk() to match their expected behaviour on blocked
tasks, skipping all scheduler-internal functions including
__switch_to().
Other than the above, there should be no functional change as a result
of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20211129142849.3056714-4-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
(cherry picked from commit 86bcbafcb7)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: I518cfec11d8f661c9fa0bbe23bec4e465599bfe8
When CONFIG_FUNCTION_GRAPH_TRACER is selected and the function graph
tracer is in use, unwind_frame() may erroneously associate a traced
function with an incorrect return address. This can happen when starting
an unwind from a pt_regs, or when unwinding across an exception
boundary.
This can be seen when recording with perf while the function graph
tracer is in use. For example:
| # echo function_graph > /sys/kernel/debug/tracing/current_tracer
| # perf record -g -e raw_syscalls:sys_enter:k /bin/true
| # perf report
... reports the callchain erroneously as:
| el0t_64_sync
| el0t_64_sync_handler
| el0_svc_common.constprop.0
| perf_callchain
| get_perf_callchain
| syscall_trace_enter
| syscall_trace_enter
... whereas when the function graph tracer is not in use, it reports:
| el0t_64_sync
| el0t_64_sync_handler
| el0_svc
| do_el0_svc
| el0_svc_common.constprop.0
| syscall_trace_enter
| syscall_trace_enter
The underlying problem is that ftrace_graph_get_ret_stack() takes an
index offset from the most recent entry added to the fgraph return
stack. We start an unwind at offset 0, and increment the offset each
time we encounter a rewritten return address (i.e. when we see
`return_to_handler`). This is broken in two cases:
1) Between creating a pt_regs and starting the unwind, function calls
may place entries on the stack, leaving an arbitrary offset which we
can only determine by performing a full unwind from the caller of the
unwind code (and relying on none of the unwind code being
instrumented).
This can result in erroneous entries being reported in a backtrace
recorded by perf or kfence when the function graph tracer is in use.
Currently show_regs() is unaffected as dump_backtrace() performs an
initial unwind.
2) When unwinding across an exception boundary (whether continuing an
unwind or starting a new unwind from regs), we currently always skip
the LR of the interrupted context. Where this was live and contained
a rewritten address, we won't consume the corresponding fgraph ret
stack entry, leaving subsequent entries off-by-one.
This can result in erroneous entries being reported in a backtrace
performed by any in-kernel unwinder when that backtrace crosses an
exception boundary, with entries after the boundary being reported
incorrectly. This includes perf, kfence, show_regs(), panic(), etc.
To fix this, we need to be able to uniquely identify each rewritten
return address such that we can map this back to the original return
address. We can use HAVE_FUNCTION_GRAPH_RET_ADDR_PTR to associate
each rewritten return address with a unique location on the stack. As
the return address is passed in the LR (and so is not guaranteed a
unique location in memory), we use the FP upon entry to the function
(i.e. the address of the caller's frame record) as the return address
pointer. Any nested call will have a different FP value as the caller
must create its own frame record and update FP to point to this.
Since ftrace_graph_ret_addr() requires the return address with the PAC
stripped, the stripping of the PAC is moved before the fixup of the
rewritten address. As we would unconditionally strip the PAC, moving
this earlier is not harmful, and we can avoid a redundant strip in the
return address fixup code.
I've tested this with the perf case above, the ftrace selftests, and
a number of ad-hoc unwinder tests. The tests all pass, and I have seen
no unexpected behaviour as a result of this change. I've tested with
pointer authentication under QEMU TCG where magic-sysrq+l correctly
recovers the original return addresses.
Note that this doesn't fix the issue of skipping a live LR at an
exception boundary, which is a more general problem and requires more
substantial rework. Were we to consume the LR in all cases this would
result in warnings where the interrupted context's LR contains
`return_to_handler`, but the FP has been altered, e.g.
| func:
| <--- ftrace entry ---> // logs FP & LR, rewrites LR
| STP FP, LR, [SP, #-16]!
| MOV FP, SP
| <--- INTERRUPT --->
... as ftrace_graph_get_ret_stack() fill not find a matching entry,
triggering the WARN_ON_ONCE() in unwind_frame().
Link: https://lore.kernel.org/r/20211025164925.GB2001@C02TD0UTHF1T.local
Link: https://lore.kernel.org/r/20211027132529.30027-1-mark.rutland@arm.com
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20211029162245.39761-1-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
(cherry picked from commit c6d3cd32fd)
[willdeacon@: Fixed conflicts due to lack of kretprobes unwinder support]
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: I70564a777f1424e6906be646ed09a1e4bb502e2e
The PC update flags (which also deal with exception injection)
is one of the most complicated use of the flag we have. Make it
more fool prof by:
- moving it over to the new accessors and assign it to the
input flag set
- turn the combination of generic ELx flags with another flag
indicating the target EL itself into an explicit set of
flags for each EL and vector combination
- add a new accessor to pend the exception
This is otherwise a pretty straightformward conversion.
Reviewed-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
(cherry picked from commit 699bb2e0c6)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: Ida64b52f942e8c174c56965acad7f07e63318b3a
The KVM_ARM64_{GUEST_HAS_SVE,VCPU_SVE_FINALIZED,GUEST_HAS_PTRAUTH}
flags are purely configuration flags. Once set, they are never cleared,
but evaluated all over the code base.
Move these three flags into the configuration set in one go, using
the new accessors, and take this opportunity to drop the KVM_ARM64_
prefix which doesn't provide any help.
Reviewed-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
(cherry picked from commit 4c0680d394)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: Ic4ac6ee4b2791e58eb1a2b9b00e4b2fa15e00f8f
It so appears that each of the vcpu flags is really belonging to
one of three categories:
- a configuration flag, set once and for all
- an input flag generated by the kernel for the hypervisor to use
- a state flag that is only for the kernel's own bookkeeping
As we are going to split all the existing flags into these three
sets, introduce all three in one go.
No functional change other than a bit of bloat...
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
(cherry picked from commit 690bacb83b)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: I09747e3eb548f4f28383765126415759631b208e
Careful analysis of the vcpu flags show that this is a mix of
configuration, communication between the host and the hypervisor,
as well as anciliary state that has no consistency. It'd be a lot
better if we could split these flags into consistent categories.
However, even if we split these flags apart, we want to make sure
that each flag can only be applied to its own set, and not across
sets.
To achieve this, use a preprocessor hack so that each flag is always
associated with:
- the set that contains it,
- a mask that describe all the bits that contain it (for a simple
flag, this is the same thing as the flag itself, but we will
eventually have values that cover multiple bits at once).
Each flag is thus a triplet that is not directly usable as a value,
but used by three helpers that allow the flag to be set, cleared,
and fetched. By mandating the use of such helper, we can easily
enforce that a flag can only be used with the set it belongs to.
Finally, one last helper "unpacks" the raw value from the triplet
that represents a flag, which is useful for multi-bit values that
need to be enumerated (in a switch statement, for example).
Further patches will start making use of this infrastructure.
Signed-off-by: Marc Zyngier <maz@kernel.org>
(cherry picked from commit e87abb73e5)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: I8d13665f55f3f8a589d4b4c346e9c1eca2716ba1
The KVM FP code uses a pair of flags to denote three states:
- FP_ENABLED set: the guest owns the FP state
- FP_HOST set: the host owns the FP state
- FP_ENABLED and FP_HOST clear: nobody owns the FP state at all
and both flags set is an illegal state, which nothing ever checks
for...
As it turns out, this isn't really a good match for flags, and
we'd be better off if this was a simpler tristate, each state
having a name that actually reflect the state:
- FP_STATE_FREE
- FP_STATE_HOST_OWNED
- FP_STATE_GUEST_OWNED
Kill the two flags, and move over to an enum encoding these
three states. This results in less confusing code, and less risk of
ending up in the uncharted territory of a 4th state if we forget
to clear one of the two flags.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Reiji Watanabe <reijiw@google.com>
(cherry picked from commit f8077b0d59)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: I4623ccea6aaa9d0cb3a9dc3502483c9628a67ca6
The vcpu KVM_ARM64_FP_FOREIGN_FPSTATE flag tracks the thread's own
TIF_FOREIGN_FPSTATE so that we can evaluate just before running
the vcpu whether it the FP regs contain something that is owned
by the vcpu or not by updating the rest of the FP flags.
We do this in the hypervisor code in order to make sure we're
in a context where we are not interruptible. But we already
have a hook in the run loop to generate this flag. We may as
well update the FP flags directly and save the pointless flag
tracking.
Whilst we're at it, rename update_fp_enabled() to guest_owns_fp_regs()
to indicate what the leftover of this helper actually do.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
(cherry picked from commit e9ada6c208)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: I54be51528b03e75acc105d94bd8495a54ff07695
The defines for SVCR call it SVCR_EL0 however the architecture calls the
register SVCR with no _EL0 suffix. In preparation for generating the sysreg
definitions rename to match the architecture, no functional change.
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20220510161208.631259-6-broonie@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
(cherry picked from commit ec0067a63e)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: I7cbea5d69d9de30a48211c5822fceac89df7aa84
The SVE and SVE length configuration field LEN have constants specifying
their width called _SIZE rather than the more normal _WIDTH, in preparation
for automatic generation rename to _WIDTH. No functional change.
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20220510161208.631259-3-broonie@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
(cherry picked from commit 5b06dcfd9e)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: Iffa948078533f60fd109abb6bc2c3ad9cc9e0a0e
Currently (as of DDI0487H.a) the architecture defines the vector length
control field in ZCR and SMCR as being 4 bits wide with an additional 5
bits reserved above it marked as RAZ/WI for future expansion. The kernel
currently attempts to anticipate such expansion by treating these extra
bits as part of the LEN field but this will be inconvenient when we start
generating the defines and would cause problems in the event that the
architecture goes a different direction with these fields. Let's instead
change the defines to reflect the currently defined architecture, we can
update in future as needed.
No change in behaviour should be seen in any system, even emulated systems
using the maximum allowed vector length for the current architecture.
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20220510161208.631259-2-broonie@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
(cherry picked from commit f171f9e409)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: Ie06a96257be4de145729808da2ba0a5180fcf6b9
Changes in 5.15.71
drm/amdgpu: Separate vf2pf work item init from virt data exchange
drm/amdgpu: make sure to init common IP before gmc
staging: r8188eu: Remove support for devices with 8188FU chipset (0bda:f179)
staging: r8188eu: Add Rosewill USB-N150 Nano to device tables
usb: dwc3: gadget: Avoid starting DWC3 gadget during UDC unbind
usb: dwc3: Issue core soft reset before enabling run/stop
usb: dwc3: gadget: Prevent repeat pullup()
usb: dwc3: gadget: Refactor pullup()
usb: dwc3: gadget: Don't modify GEVNTCOUNT in pullup()
usb: dwc3: gadget: Avoid duplicate requests to enable Run/Stop
usb: add quirks for Lenovo OneLink+ Dock
usb: gadget: udc-xilinx: replace memcpy with memcpy_toio
Revert "usb: add quirks for Lenovo OneLink+ Dock"
Revert "usb: gadget: udc-xilinx: replace memcpy with memcpy_toio"
drivers/base: Fix unsigned comparison to -1 in CPUMAP_FILE_MAX_BYTES
USB: core: Fix RST error in hub.c
USB: serial: option: add Quectel BG95 0x0203 composition
USB: serial: option: add Quectel RM520N
Revert "ALSA: usb-audio: Split endpoint setups for hw_params and prepare"
ALSA: core: Fix double-free at snd_card_new()
ALSA: hda/tegra: set depop delay for tegra
ALSA: hda: add Intel 5 Series / 3400 PCI DID
ALSA: hda/realtek: Add quirk for Huawei WRT-WX9
ALSA: hda/realtek: Enable 4-speaker output Dell Precision 5570 laptop
ALSA: hda/realtek: Re-arrange quirk table entries
ALSA: hda/realtek: Add pincfg for ASUS G513 HP jack
ALSA: hda/realtek: Add pincfg for ASUS G533Z HP jack
ALSA: hda/realtek: Add quirk for ASUS GA503R laptop
ALSA: hda/realtek: Enable 4-speaker output Dell Precision 5530 laptop
iommu/vt-d: Check correct capability for sagaw determination
btrfs: fix hang during unmount when stopping block group reclaim worker
btrfs: fix hang during unmount when stopping a space reclaim worker
media: flexcop-usb: fix endpoint type check
usb: dwc3: core: leave default DMA if the controller does not support 64-bit DMA
thunderbolt: Add support for Intel Maple Ridge single port controller
efi: x86: Wipe setup_data on pure EFI boot
efi: libstub: check Shim mode using MokSBStateRT
wifi: mt76: fix reading current per-tid starting sequence number for aggregation
gpio: mockup: fix NULL pointer dereference when removing debugfs
gpio: mockup: Fix potential resource leakage when register a chip
gpiolib: cdev: Set lineevent_state::irq after IRQ register successfully
riscv: fix a nasty sigreturn bug...
kasan: call kasan_malloc() from __kmalloc_*track_caller()
can: flexcan: flexcan_mailbox_read() fix return value for drop = true
net: mana: Add rmb after checking owner bits
mm/slub: fix to return errno if kmalloc() fails
mm: slub: fix flush_cpu_slab()/__free_slab() invocations in task context.
KVM: x86: Inject #UD on emulated XSETBV if XSAVES isn't enabled
arm64: topology: fix possible overflow in amu_fie_setup()
vmlinux.lds.h: CFI: Reduce alignment of jump-table to function alignment
xfs: reorder iunlink remove operation in xfs_ifree
xfs: fix xfs_ifree() error handling to not leak perag ref
xfs: validate inode fork size against fork format
firmware: arm_scmi: Harden accesses to the reset domains
firmware: arm_scmi: Fix the asynchronous reset requests
arm64: dts: rockchip: Pull up wlan wake# on Gru-Bob
arm64: dts: rockchip: Fix typo in lisense text for PX30.Core
drm/mediatek: dsi: Add atomic {destroy,duplicate}_state, reset callbacks
arm64: dts: rockchip: Set RK3399-Gru PCLK_EDP to 24 MHz
dmaengine: ti: k3-udma-private: Fix refcount leak bug in of_xudma_dev_get()
arm64: dts: rockchip: Remove 'enable-active-low' from rk3399-puma
netfilter: nf_conntrack_sip: fix ct_sip_walk_headers
netfilter: nf_conntrack_irc: Tighten matching on DCC message
netfilter: nfnetlink_osf: fix possible bogus match in nf_osf_find()
ice: Don't double unplug aux on peer initiated reset
iavf: Fix cached head and tail value for iavf_get_tx_pending
ipvlan: Fix out-of-bound bugs caused by unset skb->mac_header
net: core: fix flow symmetric hash
net: phy: aquantia: wait for the suspend/resume operations to finish
scsi: qla2xxx: Fix memory leak in __qlt_24xx_handle_abts()
scsi: mpt3sas: Fix return value check of dma_get_required_mask()
net: bonding: Share lacpdu_mcast_addr definition
net: bonding: Unsync device addresses on ndo_stop
net: team: Unsync device addresses on ndo_stop
drm/panel: simple: Fix innolux_g121i1_l01 bus_format
MIPS: lantiq: export clk_get_io() for lantiq_wdt.ko
MIPS: Loongson32: Fix PHY-mode being left unspecified
um: fix default console kernel parameter
iavf: Fix bad page state
mlxbf_gige: clear MDIO gateway lock after read
iavf: Fix set max MTU size with port VLAN and jumbo frames
i40e: Fix VF set max MTU size
i40e: Fix set max_tx_rate when it is lower than 1 Mbps
sfc: fix TX channel offset when using legacy interrupts
sfc: fix null pointer dereference in efx_hard_start_xmit
drm/hisilicon/hibmc: Allow to be built if COMPILE_TEST is enabled
drm/hisilicon: Add depends on MMU
of: mdio: Add of_node_put() when breaking out of for_each_xx
net: ipa: properly limit modem routing table use
wireguard: ratelimiter: disable timings test by default
wireguard: netlink: avoid variable-sized memcpy on sockaddr
net: enetc: move enetc_set_psfp() out of the common enetc_set_features()
net: enetc: deny offload of tc-based TSN features on VF interfaces
net/sched: taprio: avoid disabling offload when it was never enabled
net/sched: taprio: make qdisc_leaf() see the per-netdev-queue pfifo child qdiscs
netfilter: nf_tables: fix nft_counters_enabled underflow at nf_tables_addchain()
netfilter: nf_tables: fix percpu memory leak at nf_tables_addchain()
netfilter: ebtables: fix memory leak when blob is malformed
net: ravb: Fix PHY state warning splat during system resume
net: sh_eth: Fix PHY state warning splat during system resume
can: gs_usb: gs_can_open(): fix race dev->can.state condition
perf stat: Fix BPF program section name
perf jit: Include program header in ELF files
perf kcore_copy: Do not check /proc/modules is unchanged
perf tools: Honor namespace when synthesizing build-ids
drm/mediatek: dsi: Move mtk_dsi_stop() call back to mtk_dsi_poweroff()
net/smc: Stop the CLC flow if no link to map buffers on
bonding: fix NULL deref in bond_rr_gen_slave_id
net: sunhme: Fix packet reception for len < RX_COPY_THRESHOLD
net: sched: fix possible refcount leak in tc_new_tfilter()
bnxt: prevent skb UAF after handing over to PTP worker
selftests: forwarding: add shebang for sch_red.sh
KVM: x86/mmu: Fold rmap_recycle into rmap_add
serial: fsl_lpuart: Reset prior to registration
serial: Create uart_xmit_advance()
serial: tegra: Use uart_xmit_advance(), fixes icount.tx accounting
serial: tegra-tcu: Use uart_xmit_advance(), fixes icount.tx accounting
s390/dasd: fix Oops in dasd_alias_get_start_dev due to missing pavgroup
drm/amd/amdgpu: fixing read wrong pf2vf data in SRIOV
Drivers: hv: Never allocate anything besides framebuffer from framebuffer memory region
drm/gma500: Fix BUG: sleeping function called from invalid context errors
drm/amd/pm: disable BACO entry/exit completely on several sienna cichlid cards
drm/amdgpu: use dirty framebuffer helper
drm/amd/display: Limit user regamma to a valid value
drm/amd/display: Reduce number of arguments of dml31's CalculateWatermarksAndDRAMSpeedChangeSupport()
drm/amd/display: Reduce number of arguments of dml31's CalculateFlipSchedule()
drm/amd/display: Mark dml30's UseMinimumDCFCLK() as noinline for stack usage
drm/rockchip: Fix return type of cdn_dp_connector_mode_valid
fsdax: Fix infinite loop in dax_iomap_rw()
workqueue: don't skip lockdep work dependency in cancel_work_sync()
i2c: imx: If pm_runtime_get_sync() returned 1 device access is possible
i2c: mlxbf: incorrect base address passed during io write
i2c: mlxbf: prevent stack overflow in mlxbf_i2c_smbus_start_transaction()
i2c: mlxbf: Fix frequency calculation
drm/amdgpu: don't register a dirty callback for non-atomic
NFSv4: Fixes for nfs4_inode_return_delegation()
devdax: Fix soft-reservation memory description
ext4: make directory inode spreading reflect flexbg size
ext4: fix bug in extents parsing when eh_entries == 0 and eh_depth > 0
ext4: limit the number of retries after discarding preallocations blocks
ext4: make mballoc try target group first even with mb_optimize_scan
ext4: avoid unnecessary spreading of allocations among groups
ext4: use locality group preallocation for small closed files
Linux 5.15.71
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ie66ba67f788b7ce6ffd766544f9ec0286bec5d9f
commit 1940265ede upstream.
mb_set_largest_free_order() updates lists containing groups with largest
chunk of free space of given order. The way it updates it leads to
always moving the group to the tail of the list. Thus allocations
looking for free space of given order effectively end up cycling through
all groups (and due to initialization in last to first order). This
spreads allocations among block groups which reduces performance for
rotating disks or low-end flash media. Change
mb_set_largest_free_order() to only update lists if the order of the
largest free chunk in the group changed.
Fixes: 196e402adf ("ext4: improve cr 0 / cr 1 group scanning")
CC: stable@kernel.org
Reported-and-tested-by: Stefan Wahren <stefan.wahren@i2se.com>
Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/
Link: https://lore.kernel.org/r/20220908092136.11770-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4fca50d440 upstream.
One of the side-effects of mb_optimize_scan was that the optimized
functions to select next group to try were called even before we tried
the goal group. As a result we no longer allocate files close to
corresponding inodes as well as we don't try to expand currently
allocated extent in the same group. This results in reaim regression
with workfile.disk workload of upto 8% with many clients on my test
machine:
baseline mb_optimize_scan
Hmean disk-1 2114.16 ( 0.00%) 2099.37 ( -0.70%)
Hmean disk-41 87794.43 ( 0.00%) 83787.47 * -4.56%*
Hmean disk-81 148170.73 ( 0.00%) 135527.05 * -8.53%*
Hmean disk-121 177506.11 ( 0.00%) 166284.93 * -6.32%*
Hmean disk-161 220951.51 ( 0.00%) 207563.39 * -6.06%*
Hmean disk-201 208722.74 ( 0.00%) 203235.59 ( -2.63%)
Hmean disk-241 222051.60 ( 0.00%) 217705.51 ( -1.96%)
Hmean disk-281 252244.17 ( 0.00%) 241132.72 * -4.41%*
Hmean disk-321 255844.84 ( 0.00%) 245412.84 * -4.08%*
Also this is causing huge regression (time increased by a factor of 5 or
so) when untarring archive with lots of small files on some eMMC storage
cards.
Fix the problem by making sure we try goal group first.
Fixes: 196e402adf ("ext4: improve cr 0 / cr 1 group scanning")
CC: stable@kernel.org
Reported-and-tested-by: Stefan Wahren <stefan.wahren@i2se.com>
Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/all/20220727105123.ckwrhbilzrxqpt24@quack3/
Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220908092136.11770-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 80fa46d6b9 upstream.
This patch avoids threads live-locking for hours when a large number
threads are competing over the last few free extents as they blocks
getting added and removed from preallocation pools. From our bug
reporter:
A reliable way for triggering this has multiple writers
continuously write() to files when the filesystem is full, while
small amounts of space are freed (e.g. by truncating a large file
-1MiB at a time). In the local filesystem, this can be done by
simply not checking the return code of write (0) and/or the error
(ENOSPACE) that is set. Over NFS with an async mount, even clients
with proper error checking will behave this way since the linux NFS
client implementation will not propagate the server errors [the
write syscalls immediately return success] until the file handle is
closed. This leads to a situation where NFS clients send a
continuous stream of WRITE rpcs which result in ERRNOSPACE -- but
since the client isn't seeing this, the stream of writes continues
at maximum network speed.
When some space does appear, multiple writers will all attempt to
claim it for their current write. For NFS, we may see dozens to
hundreds of threads that do this.
The real-world scenario of this is database backup tooling (in
particular, github.com/mdkent/percona-xtrabackup) which may write
large files (>1TiB) to NFS for safe keeping. Some temporary files
are written, rewound, and read back -- all before closing the file
handle (the temp file is actually unlinked, to trigger automatic
deletion on close/crash.) An application like this operating on an
async NFS mount will not see an error code until TiB have been
written/read.
The lockup was observed when running this database backup on large
filesystems (64 TiB in this case) with a high number of block
groups and no free space. Fragmentation is generally not a factor
in this filesystem (~thousands of large files, mostly contiguous
except for the parts written while the filesystem is at capacity.)
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 67feaba413 upstream.
The "hmem" platform-devices that are created to represent the
platform-advertised "Soft Reserved" memory ranges end up inserting a
resource that causes the iomem_resource tree to look like this:
340000000-43fffffff : hmem.0
340000000-43fffffff : Soft Reserved
340000000-43fffffff : dax0.0
This is because insert_resource() reparents ranges when they completely
intersect an existing range.
This matters because code that uses region_intersects() to scan for a
given IORES_DESC will only check that top-level 'hmem.0' resource and
not the 'Soft Reserved' descendant.
So, to support EINJ (via einj_error_inject()) to inject errors into
memory hosted by a dax-device, be sure to describe the memory as
IORES_DESC_SOFT_RESERVED. This is a follow-on to:
commit b13a3e5fd4 ("ACPI: APEI: Fix _EINJ vs EFI_MEMORY_SP")
...that fixed EINJ support for "Soft Reserved" ranges in the first
instance.
Fixes: 262b45ae3a ("x86/efi: EFI soft reservation to E820 enumeration")
Reported-by: Ricardo Sandoval Torres <ricardo.sandoval.torres@intel.com>
Tested-by: Ricardo Sandoval Torres <ricardo.sandoval.torres@intel.com>
Cc: <stable@vger.kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Omar Avelar <omar.avelar@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Mark Gross <markgross@kernel.org>
Link: https://lore.kernel.org/r/166397075670.389916.7435722208896316387.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>