Commit Graph

1123688 Commits

Author SHA1 Message Date
Paolo Bonzini
9389d5774a Revert "KVM: nVMX: Expose load IA32_PERF_GLOBAL_CTRL VM-{Entry,Exit} control"
This reverts commit 03a8871add.

Since commit 03a8871add ("KVM: nVMX: Expose load IA32_PERF_GLOBAL_CTRL
VM-{Entry,Exit} control"), KVM has taken ownership of the "load
IA32_PERF_GLOBAL_CTRL" VMX entry/exit control bits, trying to set these
bits in the IA32_VMX_TRUE_{ENTRY,EXIT}_CTLS MSRs if the guest's CPUID
supports the architectural PMU (CPUID[EAX=0Ah].EAX[7:0]=1), and clear
otherwise.

This was a misguided attempt at mimicking what commit 5f76f6f5ff
("KVM: nVMX: Do not expose MPX VMX controls when guest MPX disabled",
2018-10-01) did for MPX.  However, that commit was a workaround for
another KVM bug and not something that should be imitated.  Mucking with
the VMX MSRs creates a subtle, difficult to maintain ABI as KVM must
ensure that any internal changes, e.g. to how KVM handles _any_ guest
CPUID changes, yield the same functional result.  Therefore, KVM's policy
is to let userspace have full control of the guest vCPU model so long
as the host kernel is not at risk.

Now that KVM really truly ensures kvm_set_msr() will succeed by loading
PERF_GLOBAL_CTRL if and only if it exists, revert KVM's misguided and
roundabout behavior.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[sean: make it a pure revert]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220722224409.1336532-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:24:41 -04:00
Sean Christopherson
4496a6f9b4 KVM: nVMX: Attempt to load PERF_GLOBAL_CTRL on nVMX xfer iff it exists
Attempt to load PERF_GLOBAL_CTRL during nested VM-Enter/VM-Exit if and
only if the MSR exists (according to the guest vCPU model).  KVM has very
misguided handling of VM_{ENTRY,EXIT}_LOAD_IA32_PERF_GLOBAL_CTRL and
attempts to force the nVMX MSR settings to match the vPMU model, i.e. to
hide/expose the control based on whether or not the MSR exists from the
guest's perspective.

KVM's modifications fail to handle the scenario where the vPMU is hidden
from the guest _after_ being exposed to the guest, e.g. by userspace
doing multiple KVM_SET_CPUID2 calls, which is allowed if done before any
KVM_RUN.  nested_vmx_pmu_refresh() is called if and only if there's a
recognized vPMU, i.e. KVM will leave the bits in the allow state and then
ultimately reject the MSR load and WARN.

KVM should not force the VMX MSRs in the first place.  KVM taking control
of the MSRs was a misguided attempt at mimicking what commit 5f76f6f5ff
("KVM: nVMX: Do not expose MPX VMX controls when guest MPX disabled",
2018-10-01) did for MPX.  However, the MPX commit was a workaround for
another KVM bug and not something that should be imitated (and it should
never been done in the first place).

In other words, KVM's ABI _should_ be that userspace has full control
over the MSRs, at which point triggering the WARN that loading the MSR
must not fail is trivial.

The intent of the WARN is still valid; KVM has consistency checks to
ensure that vmcs12->{guest,host}_ia32_perf_global_ctrl is valid.  The
problem is that '0' must be considered a valid value at all times, and so
the simple/obvious solution is to just not actually load the MSR when it
does not exist.  It is userspace's responsibility to provide a sane vCPU
model, i.e. KVM is well within its ABI and Intel's VMX architecture to
skip the loads if the MSR does not exist.

Fixes: 03a8871add ("KVM: nVMX: Expose load IA32_PERF_GLOBAL_CTRL VM-{Entry,Exit} control")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220722224409.1336532-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:24:22 -04:00
Sean Christopherson
b663f0b5f3 KVM: VMX: Add helper to check if the guest PMU has PERF_GLOBAL_CTRL
Add a helper to check of the guest PMU has PERF_GLOBAL_CTRL, which is
unintuitive _and_ diverges from Intel's architecturally defined behavior.
Even worse, KVM currently implements the check using two different (but
equivalent) checks, _and_ there has been at least one attempt to add a
_third_ flavor.

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220722224409.1336532-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:23:57 -04:00
Sean Christopherson
93255bf929 KVM: VMX: Mark all PERF_GLOBAL_(OVF)_CTRL bits reserved if there's no vPMU
Mark all MSR_CORE_PERF_GLOBAL_CTRL and MSR_CORE_PERF_GLOBAL_OVF_CTRL bits
as reserved if there is no guest vPMU.  The nVMX VM-Entry consistency
checks do not check for a valid vPMU prior to consuming the masks via
kvm_valid_perf_global_ctrl(), i.e. may incorrectly allow a non-zero mask
to be loaded via VM-Enter or VM-Exit (well, attempted to be loaded, the
actual MSR load will be rejected by intel_is_valid_msr()).

Fixes: f5132b0138 ("KVM: Expose a version 2 architectural PMU to a guests")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220722224409.1336532-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:23:40 -04:00
Paolo Bonzini
8805875aa4 Revert "KVM: nVMX: Do not expose MPX VMX controls when guest MPX disabled"
Since commit 5f76f6f5ff ("KVM: nVMX: Do not expose MPX VMX controls
when guest MPX disabled"), KVM has taken ownership of the "load
IA32_BNDCFGS" and "clear IA32_BNDCFGS" VMX entry/exit controls,
trying to set these bits in the IA32_VMX_TRUE_{ENTRY,EXIT}_CTLS
MSRs if the guest's CPUID supports MPX, and clear otherwise.

The intent of the patch was to apply it to L0 in order to work around
L1 kernels that lack the fix in commit 691bd4340b ("kvm: vmx: allow
host to access guest MSR_IA32_BNDCFGS", 2017-07-04): by hiding the
control bits from L0, L1 hides BNDCFGS from KVM_GET_MSR_INDEX_LIST,
and the L1 bug is neutralized even in the lack of commit 691bd4340b.

This was perhaps a sensible kludge at the time, but a horrible
idea in the long term and in fact it has not been extended to
other CPUID bits like these:

  X86_FEATURE_LM => VM_EXIT_HOST_ADDR_SPACE_SIZE, VM_ENTRY_IA32E_MODE,
                    VMX_MISC_SAVE_EFER_LMA

  X86_FEATURE_TSC => CPU_BASED_RDTSC_EXITING, CPU_BASED_USE_TSC_OFFSETTING,
                     SECONDARY_EXEC_TSC_SCALING

  X86_FEATURE_INVPCID_SINGLE => SECONDARY_EXEC_ENABLE_INVPCID

  X86_FEATURE_MWAIT => CPU_BASED_MONITOR_EXITING, CPU_BASED_MWAIT_EXITING

  X86_FEATURE_INTEL_PT => SECONDARY_EXEC_PT_CONCEAL_VMX, SECONDARY_EXEC_PT_USE_GPA,
                          VM_EXIT_CLEAR_IA32_RTIT_CTL, VM_ENTRY_LOAD_IA32_RTIT_CTL

  X86_FEATURE_XSAVES => SECONDARY_EXEC_XSAVES

These days it's sort of common knowledge that any MSR in
KVM_GET_MSR_INDEX_LIST must allow *at least* setting it with KVM_SET_MSR
to a default value, so it is unlikely that something like commit
5f76f6f5ff will be needed again.  So revert it, at the potential cost
of breaking L1s with a 6 year old kernel.  While in principle the L0 owner
doesn't control what runs on L1, such an old hypervisor would probably
have many other bugs.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:28 -04:00
Sean Christopherson
f8ae08f978 KVM: nVMX: Let userspace set nVMX MSR to any _host_ supported value
Restrict the nVMX MSRs based on KVM's config, not based on the guest's
current config.  Using the guest's config to audit the new config
prevents userspace from restoring the original config (KVM's config) if
at any point in the past the guest's config was restricted in any way.

Fixes: 62cc6b9dc6 ("KVM: nVMX: support restore of VMX capability MSRs")
Cc: stable@vger.kernel.org
Cc: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220607213604.3346000-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:28 -04:00
Sean Christopherson
a645c2b506 KVM: nVMX: Rename handle_vm{on,off}() to handle_vmx{on,off}()
Rename the exit handlers for VMXON and VMXOFF to match the instruction
names, the terms "vmon" and "vmoff" are not used anywhere in Intel's
documentation, nor are they used elsehwere in KVM.

Sadly, the exit reasons are exposed to userspace and so cannot be renamed
without breaking userspace. :-(

Fixes: ec378aeef9 ("KVM: nVMX: Implement VMXON and VMXOFF")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220607213604.3346000-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:27 -04:00
Sean Christopherson
c7d855c2af KVM: nVMX: Inject #UD if VMXON is attempted with incompatible CR0/CR4
Inject a #UD if L1 attempts VMXON with a CR0 or CR4 that is disallowed
per the associated nested VMX MSRs' fixed0/1 settings.  KVM cannot rely
on hardware to perform the checks, even for the few checks that have
higher priority than VM-Exit, as (a) KVM may have forced CR0/CR4 bits in
hardware while running the guest, (b) there may incompatible CR0/CR4 bits
that have lower priority than VM-Exit, e.g. CR0.NE, and (c) userspace may
have further restricted the allowed CR0/CR4 values by manipulating the
guest's nested VMX MSRs.

Note, despite a very strong desire to throw shade at Jim, commit
70f3aac964 ("kvm: nVMX: Remove superfluous VMX instruction fault checks")
is not to blame for the buggy behavior (though the comment...).  That
commit only removed the CR0.PE, EFLAGS.VM, and COMPATIBILITY mode checks
(though it did erroneously drop the CPL check, but that has already been
remedied).  KVM may force CR0.PE=1, but will do so only when also
forcing EFLAGS.VM=1 to emulate Real Mode, i.e. hardware will still #UD.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=216033
Fixes: ec378aeef9 ("KVM: nVMX: Implement VMXON and VMXOFF")
Reported-by: Eric Li <ercli@ucdavis.edu>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220607213604.3346000-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:26 -04:00
Sean Christopherson
ca58f3aa53 KVM: nVMX: Account for KVM reserved CR4 bits in consistency checks
Check that the guest (L2) and host (L1) CR4 values that would be loaded
by nested VM-Enter and VM-Exit respectively are valid with respect to
KVM's (L0 host) allowed CR4 bits.  Failure to check KVM reserved bits
would allow L1 to load an illegal CR4 (or trigger hardware VM-Fail or
failed VM-Entry) by massaging guest CPUID to allow features that are not
supported by KVM.  Amusingly, KVM itself is an accomplice in its doom, as
KVM adjusts L1's MSR_IA32_VMX_CR4_FIXED1 to allow L1 to enable bits for
L2 based on L1's CPUID model.

Note, although nested_{guest,host}_cr4_valid() are _currently_ used if
and only if the vCPU is post-VMXON (nested.vmxon == true), that may not
be true in the future, e.g. emulating VMXON has a bug where it doesn't
check the allowed/required CR0/CR4 bits.

Cc: stable@vger.kernel.org
Fixes: 3899152ccb ("KVM: nVMX: fix checks on CR{0,4} during virtual VMX operation")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220607213604.3346000-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:26 -04:00
Sean Christopherson
c33f6f2228 KVM: x86: Split kvm_is_valid_cr4() and export only the non-vendor bits
Split the common x86 parts of kvm_is_valid_cr4(), i.e. the reserved bits
checks, into a separate helper, __kvm_is_valid_cr4(), and export only the
inner helper to vendor code in order to prevent nested VMX from calling
back into vmx_is_valid_cr4() via kvm_is_valid_cr4().

On SVM, this is a nop as SVM doesn't place any additional restrictions on
CR4.

On VMX, this is also currently a nop, but only because nested VMX is
missing checks on reserved CR4 bits for nested VM-Enter.  That bug will
be fixed in a future patch, and could simply use kvm_is_valid_cr4() as-is,
but nVMX has _another_ bug where VMXON emulation doesn't enforce VMX's
restrictions on CR0/CR4.  The cleanest and most intuitive way to fix the
VMXON bug is to use nested_host_cr{0,4}_valid().  If the CR4 variant
routes through kvm_is_valid_cr4(), using nested_host_cr4_valid() won't do
the right thing for the VMXON case as vmx_is_valid_cr4() enforces VMX's
restrictions if and only if the vCPU is post-VMXON.

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220607213604.3346000-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:25 -04:00
Sean Christopherson
cfe12e64b0 KVM: selftests: Add an option to run vCPUs while disabling dirty logging
Add a command line option to dirty_log_perf_test to run vCPUs for the
entire duration of disabling dirty logging.  By default, the test stops
running runs vCPUs before disabling dirty logging, which is faster but
less interesting as it doesn't stress KVM's handling of contention
between page faults and the zapping of collapsible SPTEs.  Enabling the
flag also lets the user verify that KVM is indeed rebuilding zapped SPTEs
as huge pages by checking KVM's pages_{1g,2m,4k} stats.  Without vCPUs to
fault in the zapped SPTEs, the stats will show that KVM is zapping pages,
but they never show whether or not KVM actually allows huge pages to be
recreated.

Note!  Enabling the flag can _significantly_ increase runtime, especially
if the thread that's disabling dirty logging doesn't have a dedicated
pCPU, e.g. if all pCPUs are used to run vCPUs.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220715232107.3775620-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:24 -04:00
Sean Christopherson
85f44f8cc0 KVM: x86/mmu: Don't bottom out on leafs when zapping collapsible SPTEs
When zapping collapsible SPTEs in the TDP MMU, don't bottom out on a leaf
SPTE now that KVM doesn't require a PFN to compute the host mapping level,
i.e. now that there's no need to first find a leaf SPTE and then step
back up.

Drop the now unused tdp_iter_step_up(), as it is not the safest of
helpers (using any of the low level iterators requires some understanding
of the various side effects).

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220715232107.3775620-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:24 -04:00
Sean Christopherson
65e3b446bc KVM: x86/mmu: Document the "rules" for using host_pfn_mapping_level()
Add a comment to document how host_pfn_mapping_level() can be used safely,
as the line between safe and dangerous is quite thin.  E.g. if KVM were
to ever support in-place promotion to create huge pages, consuming the
level is safe if the caller holds mmu_lock and checks that there's an
existing _leaf_ SPTE, but unsafe if the caller only checks that there's a
non-leaf SPTE.

Opportunistically tweak the existing comments to explicitly document why
KVM needs to use READ_ONCE().

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220715232107.3775620-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:23 -04:00
Sean Christopherson
a8ac499bb6 KVM: x86/mmu: Don't require refcounted "struct page" to create huge SPTEs
Drop the requirement that a pfn be backed by a refcounted, compound or
or ZONE_DEVICE, struct page, and instead rely solely on the host page
tables to identify huge pages.  The PageCompound() check is a remnant of
an old implementation that identified (well, attempt to identify) huge
pages without walking the host page tables.  The ZONE_DEVICE check was
added as an exception to the PageCompound() requirement.  In other words,
neither check is actually a hard requirement, if the primary has a pfn
backed with a huge page, then KVM can back the pfn with a huge page
regardless of the backing store.

Dropping the @pfn parameter will also allow KVM to query the max host
mapping level without having to first get the pfn, which is advantageous
for use outside of the page fault path where KVM wants to take action if
and only if a page can be mapped huge, i.e. avoids the pfn lookup for
gfns that can't be backed with a huge page.

Cc: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Message-Id: <20220715232107.3775620-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:22 -04:00
Sean Christopherson
d5e90a6998 KVM: x86/mmu: Restrict mapping level based on guest MTRR iff they're used
Restrict the mapping level for SPTEs based on the guest MTRRs if and only
if KVM may actually use the guest MTRRs to compute the "real" memtype.
For all forms of paging, guest MTRRs are purely virtual in the sense that
they are completely ignored by hardware, i.e. they affect the memtype
only if software manually consumes them.  The only scenario where KVM
consumes the guest MTRRs is when shadow_memtype_mask is non-zero and the
guest has non-coherent DMA, in all other cases KVM simply leaves the PAT
field in SPTEs as '0' to encode WB memtype.

Note, KVM may still ultimately ignore guest MTRRs, e.g. if the backing
pfn is host MMIO, but false positives are ok as they only cause a slight
performance blip (unless the guest is doing weird things with its MTRRs,
which is extremely unlikely).

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220715230016.3762909-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:22 -04:00
Sean Christopherson
38bf9d7bf2 KVM: x86/mmu: Add shadow mask for effective host MTRR memtype
Add shadow_memtype_mask to capture that EPT needs a non-zero memtype mask
instead of relying on TDP being enabled, as NPT doesn't need a non-zero
mask.  This is a glorified nop as kvm_x86_ops.get_mt_mask() returns zero
for NPT anyways.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220715230016.3762909-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:21 -04:00
Sean Christopherson
82ffad2ddf KVM: x86: Drop unnecessary goto+label in kvm_arch_init()
Return directly if kvm_arch_init() detects an error before doing any real
work, jumping through a label obfuscates what's happening and carries the
unnecessary risk of leaving 'r' uninitialized.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220715230016.3762909-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:20 -04:00
Sean Christopherson
94bda2f4cd KVM: x86: Reject loading KVM if host.PAT[0] != WB
Reject KVM if entry '0' in the host's IA32_PAT MSR is not programmed to
writeback (WB) memtype.  KVM subtly relies on IA32_PAT entry '0' to be
programmed to WB by leaving the PAT bits in shadow paging and NPT SPTEs
as '0'.  If something other than WB is in PAT[0], at _best_ guests will
suffer very poor performance, and at worst KVM will crash the system by
breaking cache-coherency expecations (e.g. using WC for guest memory).

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220715230016.3762909-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:20 -04:00
Suravee Suthikulpanit
01e69cef63 KVM: SVM: Fix x2APIC MSRs interception
The index for svm_direct_access_msrs was incorrectly initialized with
the APIC MMIO register macros. Fix by introducing a macro for calculating
x2APIC MSRs.

Fixes: 5c127c8547 ("KVM: SVM: Adding support for configuring x2APIC MSRs interception")
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220718083833.222117-1-suravee.suthikulpanit@amd.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:19 -04:00
Sean Christopherson
3c2e10373e KVM: x86/mmu: Remove underscores from __pte_list_remove()
Remove the underscores from __pte_list_remove(), the function formerly
known as pte_list_remove() is now named kvm_zap_one_rmap_spte() to show
that it zaps rmaps/PTEs, i.e. doesn't just remove an entry from a list.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220715224226.3749507-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:18 -04:00
Sean Christopherson
9202aee816 KVM: x86/mmu: Rename pte_list_{destroy,remove}() to show they zap SPTEs
Rename pte_list_remove() and pte_list_destroy() to kvm_zap_one_rmap_spte()
and kvm_zap_all_rmap_sptes() respectively to document that (a) they zap
SPTEs and (b) to better document how they differ (remove vs. destroy does
not exactly scream "one vs. all").

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220715224226.3749507-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:18 -04:00
Sean Christopherson
f8480721a7 KVM: x86/mmu: Rename rmap zap helpers to eliminate "unmap" wrapper
Rename kvm_unmap_rmap() and kvm_zap_rmap() to kvm_zap_rmap() and
__kvm_zap_rmap() respectively to show that what was the "unmap" helper is
just a wrapper for the "zap" helper, i.e. that they do the exact same
thing, one just exists to deal with its caller passing in more params.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220715224226.3749507-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:17 -04:00
Sean Christopherson
2833eda0e2 KVM: x86/mmu: Rename __kvm_zap_rmaps() to align with other nomenclature
Rename __kvm_zap_rmaps() to kvm_rmap_zap_gfn_range() to avoid future
confusion with a soon-to-be-introduced __kvm_zap_rmap().  Using a plural
"rmaps" is somewhat ambiguous without additional context, as it's not
obvious whether it's referring to multiple rmap lists, versus multiple
rmap entries within a single list.

Use kvm_rmap_zap_gfn_range() to align with the pattern established by
kvm_rmap_zap_collapsible_sptes(), without losing the information that it
zaps only rmap-based MMUs, i.e. don't rename it to __kvm_zap_gfn_range().

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220715224226.3749507-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:16 -04:00
Sean Christopherson
aed02fe3ca KVM: x86/mmu: Drop the "p is for pointer" from rmap helpers
Drop the trailing "p" from rmap helpers, i.e. rename functions to simply
be kvm_<action>_rmap().  Declaring that a function takes a pointer is
completely unnecessary and goes against kernel style.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220715224226.3749507-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:16 -04:00
Sean Christopherson
a42989e7fb KVM: x86/mmu: Directly "destroy" PTE list when recycling rmaps
Use pte_list_destroy() directly when recycling rmaps instead of bouncing
through kvm_unmap_rmapp() and kvm_zap_rmapp().  Calling kvm_unmap_rmapp()
is unnecessary and odd as it requires passing dummy parameters; passing
NULL for @slot when __rmap_add() already has a valid slot is especially
weird and confusing.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220715224226.3749507-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:15 -04:00
Sean Christopherson
35d539c3e4 KVM: x86/mmu: Return a u64 (the old SPTE) from mmu_spte_clear_track_bits()
Return a u64, not an int, from mmu_spte_clear_track_bits().  The return
value is the old SPTE value, which is very much a 64-bit value.  The sole
caller that consumes the return value, drop_spte(), already uses a u64.
The only reason that truncating the SPTE value is not problematic is
because drop_spte() only queries the shadow-present bit, which is in the
lower 32 bits.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220715224226.3749507-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:15 -04:00
Maciej S. Szmigiero
da0b93d65e KVM: nSVM: Pull CS.Base from actual VMCB12 for soft int/ex re-injection
enter_svm_guest_mode() first calls nested_vmcb02_prepare_control() to copy
control fields from VMCB12 to the current VMCB, then
nested_vmcb02_prepare_save() to perform a similar copy of the save area.

This means that nested_vmcb02_prepare_control() still runs with the
previous save area values in the current VMCB so it shouldn't take the L2
guest CS.Base from this area.

Explicitly pull CS.Base from the actual VMCB12 instead in
enter_svm_guest_mode().

Granted, having a non-zero CS.Base is a very rare thing (and even
impossible in 64-bit mode), having it change between nested VMRUNs is
probably even rarer, but if it happens it would create a really subtle bug
so it's better to fix it upfront.

Fixes: 6ef88d6e36 ("KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction")
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <4caa0f67589ae3c22c311ee0e6139496902f2edc.1658159083.git.maciej.szmigiero@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:14 -04:00
Linus Torvalds
e64ab2dbd8 watch_queue: Fix missing locking in add_watch_to_object()
If a watch is being added to a queue, it needs to guard against
interference from addition of a new watch, manual removal of a watch and
removal of a watch due to some other queue being destroyed.

KEYCTL_WATCH_KEY guards against this for the same {key,queue} pair by
holding the key->sem writelocked and by holding refs on both the key and
the queue - but that doesn't prevent interaction from other {key,queue}
pairs.

While add_watch_to_object() does take the spinlock on the event queue,
it doesn't take the lock on the source's watch list.  The assumption was
that the caller would prevent that (say by taking key->sem) - but that
doesn't prevent interference from the destruction of another queue.

Fix this by locking the watcher list in add_watch_to_object().

Fixes: c73be61ced ("pipe: Add general notification queue support")
Reported-by: syzbot+03d7b43290037d1f87ca@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: keyrings@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-07-28 10:06:49 -07:00
David Howells
e0339f036e watch_queue: Fix missing rcu annotation
Since __post_watch_notification() walks wlist->watchers with only the
RCU read lock held, we need to use RCU methods to add to the list (we
already use RCU methods to remove from the list).

Fix add_watch_to_object() to use hlist_add_head_rcu() instead of
hlist_add_head() for that list.

Fixes: c73be61ced ("pipe: Add general notification queue support")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-07-28 10:06:49 -07:00
Sumanth Korikkar
ded466e180 s390/unwind: fix fgraph return address recovery
When HAVE_FUNCTION_GRAPH_RET_ADDR_PTR is defined, the return
address to the fgraph caller is recovered by tagging it along with the
stack pointer of ftrace stack. This makes the stack unwinding more
reliable.

When the fgraph return address is modified to return_to_handler,
ftrace_graph_ret_addr tries to restore it to the original
value using tagged stack pointer.

Fix this by passing tagged sp to ftrace_graph_ret_addr.

Fixes: d81675b60d ("s390/unwind: recover kretprobe modified return address in stacktrace")
Cc: <stable@vger.kernel.org> # 5.18
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2022-07-28 18:05:24 +02:00
Sven Schnelle
520763a327 s390/nmi: use irqentry_nmi_enter()/irqentry_nmi_exit()
With generic entry in place switch the nmi handler to use
the generic entry helper functions.

Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2022-07-28 18:05:24 +02:00
Janosch Frank
a0c0c44e9a s390: add ELF note type for encrypted CPU state of a PV VCPU
The type NT_S390_PV_CPU_DATA note contains the encrypted CPU state of
a PV VCPU. It's only relevant in dumps of s390 PV VMs and can't be
decrypted without a second block of encrypted data which provides key
parts. Therefore we only reserve the note type here.

The zgetdump tool from the s390-tools package can, together with a
Customer Communication Key, be used to convert a PV VM dump into a
normal VM dump. zgetdump will decrypt the CPU data and overwrite the
other respective notes to make the data accessible for crash and other
debugging tools.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
[agordeev@linux.ibm.com changed desctiption]
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2022-07-28 18:05:24 +02:00
Alexander Gordeev
e409b7f191 s390/smp,ptdump: add absolute lowcore markers
Add "Lowcore Area Start" and "Lowcore Area End" markers
that fence pages where absolute lowcore resides.

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2022-07-28 18:05:24 +02:00
Alexander Gordeev
7d06fed77b s390/smp: rework absolute lowcore access
Temporary unsetting of the prefix page in memcpy_absolute() routine
poses a risk of executing code path with unexpectedly disabled prefix
page. This rework avoids the prefix page uninstalling and disabling
of normal and machine check interrupts when accessing the absolute
zero memory.

Although memcpy_absolute() routine can access the whole memory, it is
only used to update the absolute zero lowcore. This rework therefore
introduces a new mechanism for the absolute zero lowcore access and
scraps memcpy_absolute() routine for good.

Instead, an area is reserved in the virtual memory that is used for
the absolute lowcore access only. That area holds an array of 8KB
virtual mappings - one per CPU. Whenever a CPU is brought online, the
corresponding item is mapped to the real address of the previously
installed prefix page.

The absolute zero lowcore access works like this: a CPU calls the
new primitive get_abs_lowcore() to obtain its 8KB mapping as a
pointer to the struct lowcore. Virtual address references to that
pointer get translated to the real addresses of the prefix page,
which in turn gets swapped with the absolute zero memory addresses
due to prefixing. Once the pointer is not needed it must be released
with put_abs_lowcore() primitive:

	struct lowcore *abs_lc;
	unsigned long flags;

	abs_lc = get_abs_lowcore(&flags);
	abs_lc->... = ...;
	put_abs_lowcore(abs_lc, flags);

To ensure the described mechanism works large segment- and region-
table entries must be avoided for the 8KB mappings. Failure to do
so results in usage of Region-Frame Absolute Address (RFAA) or
Segment-Frame Absolute Address (SFAA) large page fields. In that
case absolute addresses would be used to address the prefix page
instead of the real ones and the prefixing would get bypassed.

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2022-07-28 18:05:23 +02:00
Alexander Gordeev
2e2493c675 s390/setup: rearrange absolute lowcore initialization
Make the absolute lowcore assignments immediately follow
the boot CPU lowcore same member assignments. This way
readability improves when reading from up to down, with
no out of order mcck stack allocation in-between.

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2022-07-28 18:05:23 +02:00
Alexander Gordeev
57ad19bcde s390/boot: cleanup adjust_to_uv_max() function
Uncouple input and output arguments by making the latter
the function return value.

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2022-07-28 18:05:23 +02:00
Alexander Gordeev
6f5c672d17 s390/smp: enforce lowcore protection on CPU restart
As result of commit 915fea04f9 ("s390/smp: enable DAT before
CPU restart callback is called") the low-address protection bit
gets mistakenly unset in control register 0 save area of the
absolute zero memory. That area is used when manual PSW restart
happened to hit an offline CPU. In this case the low-address
protection for that CPU will be dropped.

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Fixes: 915fea04f9 ("s390/smp: enable DAT before CPU restart callback is called")
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2022-07-28 18:05:23 +02:00
Jason Wang
fc7fab3f91 s390/tape: fix comment typo
Remove duplicated `that' in a comment

Signed-off-by: Jason Wang <wangborong@cdjrlc.com>
Link: https://lore.kernel.org/r/20220715053838.5005-1-wangborong@cdjrlc.com
[agordeev@linux.ibm.com rephrased commit message]
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2022-07-28 18:05:23 +02:00
Randy Dunlap
57c3ae8e44 s390/hmcdrv: fix Kconfig "its" grammar
Use the possessive "its" instead of the contraction "it's"
where appropriate.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: linux-s390@vger.kernel.org
Link: https://lore.kernel.org/r/20220715020010.12678-1-rdunlap@infradead.org
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2022-07-28 18:05:22 +02:00
Alexander Gordeev
2f089a3846 Merge branch 'vmcore-iov_iter' into features
Pull changes that finalize switching of copy_oldmem_page() callback
to iov_iter interface. These changes were pulled in work.iov_iter of
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2022-07-28 17:53:11 +02:00
wangjianli
dd390cba54 IB/qib: Fix repeated "in" within comments
Delete the redundant word 'in'.

Link: https://lore.kernel.org/r/20220724074407.18552-1-wangjianli@cdjrlc.com
Signed-off-by: wangjianli <wangjianli@cdjrlc.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-07-28 12:50:05 -03:00
João Paulo Rechi Vita
339170d8d3 docs: efi-stub: Fix paths for x86 / arm stubs
This fixes the paths of x86 / arm efi-stub source files.

Signed-off-by: João Paulo Rechi Vita <jprvita@endlessos.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20220727140539.10021-1-jprvita@endlessos.org
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2022-07-28 09:41:56 -06:00
Yanteng Si
4116ff7974 Docs/zh_CN: Update the translation of sched-stats to 5.19-rc8
Update to commit 6c757e9f55 ("docs/scheduler:
fix unit error")

ddb21d27a6 ("docs/scheduler: Change unit of
cpu_time and rq_time to nanoseconds")

Signed-off-by: Yanteng Si <siyanteng@loongson.cn>
Reviewed-by: Wu XiangCheng <bobwxc@email.cn>
Reviewed-by: Alex Shi <alexs@kernel.org>
Link: https://lore.kernel.org/r/3cb1c4c466dfa38d72a867dc6e2c833ceb69ecb7.1658983157.git.siyanteng@loongson.cn
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2022-07-28 09:37:25 -06:00
Yanteng Si
ce1120076c Docs/zh_CN: Update the translation of pci to 5.19-rc8
Update to commit f21949c149 ("PCI/doc:Update
obsolete pci_set_dma_mask() references")

05b0ebd06a ("PCI/doc: cleanup references to
the legacy PCI DMA API")

Signed-off-by: Yanteng Si <siyanteng@loongson.cn>
Reviewed-by: Wu XiangCheng <bobwxc@email.cn>
Reviewed-by: Alex Shi <alexs@kernel.org>
Link: https://lore.kernel.org/r/bccb98a1c6706ecfd4aaba8c27f1c64024e1c139.1658983157.git.siyanteng@loongson.cn
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2022-07-28 09:37:25 -06:00
Yanteng Si
c78478e164 Docs/zh_CN: Update the translation of pci-iov-howto to 5.19-rc8
Update to commit 4f23bd5d09 ("PCI/doc: Convert
examples to generic power management")

Signed-off-by: Yanteng Si <siyanteng@loongson.cn>
Reviewed-by: Alex Shi <alexs@kernel.org>
Link: https://lore.kernel.org/r/24c53aabc9d942f6fe38b8d3a843329edca3d18c.1658983157.git.siyanteng@loongson.cn
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2022-07-28 09:37:25 -06:00
Yanteng Si
83b41bb27b Docs/zh_CN: Update the translation of usage to 5.19-rc8
Update to commit b57e39a743 ("Docs/admin-guide
/damon/sysfs: document 'LRU_DEPRIO' scheme action")

0bcba960b1 ("Docs/admin-guide/damon/sysfs: document 'LRU_PRIO' scheme action")

Signed-off-by: Yanteng Si <siyanteng@loongson.cn>
Reviewed-by: Alex Shi <alexs@kernel.org>
Link: https://lore.kernel.org/r/130795ae1a4097e25e8e354870e65c3018241a8e.1658983157.git.siyanteng@loongson.cn
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2022-07-28 09:37:25 -06:00
Yanteng Si
63c1d2516b Docs/zh_CN: Update the translation of testing-overview to 5.19-rc8
Update to commit 12379401c0 ("Documentation: dev-tools:
Add a section for static analysis tools")

Signed-off-by: Yanteng Si <siyanteng@loongson.cn>
Reviewed-by: Alex Shi <alexs@kernel.org>
Link: https://lore.kernel.org/r/b9911916446f858fe95e4d4bfeaecdab14bc0b6a.1658983157.git.siyanteng@loongson.cn
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2022-07-28 09:37:25 -06:00
Yanteng Si
6a5057e9dc Docs/zh_CN: Update the translation of sparse to 5.19-rc8
Update to commit179fd6ba3bac ("Documentation/sparse:
 add hints about __CHECKER__")

Signed-off-by: Yanteng Si <siyanteng@loongson.cn>
Reviewed-by: Alex Shi <alexs@kernel.org>
Link: https://lore.kernel.org/r/2ecdf7ddc8644e9031e4e2af947b97d63ab046f0.1658983157.git.siyanteng@loongson.cn
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2022-07-28 09:37:25 -06:00
Yanteng Si
507f48799a Docs/zh_CN: Update the translation of kasan to 5.19-rc8
update to commit 3ff16d30f5 ("kasan: test:
improve failure message in KUNIT_EXPECT_KASAN_FAIL()")

c9d1af2b78 ("mm/kasan: move kasan.fault to mm/kasan/report.c")
2d27e58514 ("kasan: Extend KASAN mode kernel parameter")
8479d7b5be ("kasan: documentation updates")
c2ec0c8f68 ("kasan: update documentation")

Signed-off-by: Yanteng Si <siyanteng@loongson.cn>
Reviewed-by: Alex Shi <alexs@kernel.org>
Link: https://lore.kernel.org/r/abbb6c5cc5a7daf0720a9cb0a8255472d4ac69b4.1658983157.git.siyanteng@loongson.cn
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2022-07-28 09:37:25 -06:00
Yanteng Si
659797dc4d Docs/zh_CN: Update the translation of iio_configfs to 5.19-rc8
update to commit dafcf4ed83 ("iio: hrtimer: Allow
sub Hz granularity").

c1d82dbcb0 ("docs: iio: fix example formatting").

Remove some useless spaces.

Signed-off-by: Yanteng Si <siyanteng@loongson.cn>
Reviewed-by: Alex Shi <alexs@kernel.org>
Link: https://lore.kernel.org/r/1c0ae3e0932d813d41de666c1c65379786e79ba6.1658983157.git.siyanteng@loongson.cn
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2022-07-28 09:37:25 -06:00