oom-reaper and process_mrelease system call should protect against
races with exit_mmap which can destroy page tables while they
walk the VMA tree. oom-reaper protects from that race by setting
MMF_OOM_VICTIM and by relying on exit_mmap to set MMF_OOM_SKIP
before taking and releasing mmap_write_lock. process_mrelease has
to elevate mm->mm_users to prevent such race. Both oom-reaper and
process_mrelease hold mmap_read_lock when walking the VMA tree.
The locking rules and mechanisms could be simpler if exit_mmap takes
mmap_write_lock while executing destructive operations such as
free_pgtables.
Change exit_mmap to hold the mmap_write_lock when calling
free_pgtables. Operations like unmap_vmas() and unlock_range() are not
destructive and could run under mmap_read_lock but for simplicity we
take one mmap_write_lock during almost the entire operation. Note
also that because oom-reaper checks VM_LOCKED flag, unlock_range()
should not be allowed to race with it.
In most cases this lock should be uncontended. Previously, Kirill
reported ~4% regression caused by a similar change [1]. We reran the
same test and although the individual results are quite noisy, the
percentiles show lower regression with 1.6% being the worst case [2].
The change allows oom-reaper and process_mrelease to execute safely
under mmap_read_lock without worries that exit_mmap might destroy page
tables from under them.
[1] https://lore.kernel.org/all/20170725141723.ivukwhddk2voyhuc@node.shutemov.name/
[2] https://lore.kernel.org/all/CAJuCfpGC9-c9P40x7oy=jy5SphMcd0o0G_6U1-+JAziGKG6dGA@mail.gmail.com/
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Link: https://lore.kernel.org/all/20211124235906.14437-1-surenb@google.com/
Bug: 130172058
Bug: 189803002
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ic87272d09a0b68a1b0e968e8f1a1510fd6fc776a
Make use of the newly introduced unshare hypercall during guest teardown
to unmap guest-related data structures from the hyp stage-1.
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211215161232.1480836-15-qperret@google.com
(cherry picked from commit 52b28657eb
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git next)
Bug: 209599700
Change-Id: Ic8ec2c74d3e803e4d5f0df37d144d2c2eb0e9ad3
Introduce an unshare hypercall which can be used to unmap memory from
the hypervisor stage-1 in nVHE protected mode. This will be useful to
update the EL2 ownership state of pages during guest teardown, and
avoids keeping dangling mappings to unreferenced portions of memory.
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211215161232.1480836-14-qperret@google.com
(cherry picked from commit b8cc6eb5bd
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git next)
Bug: 209599700
Change-Id: I0b2839046914a0fd9a6d0077383d53dc84a8ac3d
Tearing down a previously shared memory region results in the borrower
losing access to the underlying pages and returning them to the "owned"
state in the owner.
Implement a do_unshare() helper, along the same lines as do_share(), to
provide this functionality for the host-to-hyp case.
Reviewed-by: Andrew Walbran <qwandor@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211215161232.1480836-13-qperret@google.com
(cherry picked from commit 376a240f03
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git next)
Bug: 209599700
Change-Id: I07aa6194099325c955f48ab608c4b601664419e5
__pkvm_host_share_hyp() shares memory between the host and the
hypervisor so implement it as an invocation of the new do_share()
mechanism.
Note that double-sharing is no longer permitted (as this allows us to
reduce the number of page-table walks significantly), but is thankfully
no longer relied upon by the host.
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211215161232.1480836-12-qperret@google.com
(cherry picked from commit 1ee32109fd
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git next)
Bug: 209599700
Change-Id: I9f5877b6ea7b9bf735a90492fbca16f485e73110
By default, protected KVM isolates memory pages so that they are
accessible only to their owner: be it the host kernel, the hypervisor
at EL2 or (in future) the guest. Establishing shared-memory regions
between these components therefore involves a transition for each page
so that the owner can share memory with a borrower under a certain set
of permissions.
Introduce a do_share() helper for safely sharing a memory region between
two components. Currently, only host-to-hyp sharing is implemented, but
the code is easily extended to handle other combinations and the
permission checks for each component are reusable.
Reviewed-by: Andrew Walbran <qwandor@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211215161232.1480836-11-qperret@google.com
(cherry picked from commit e82edcc75c
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git next)
Bug: 209599700
Change-Id: I3ee4486ab08faa24015ac608979e3596b27806a2
In preparation for adding additional locked sections for manipulating
page-tables at EL2, introduce some simple wrappers around the host and
hypervisor locks so that it's a bit easier to read and bit more difficult
to take the wrong lock (or even take them in the wrong order).
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211215161232.1480836-10-qperret@google.com
(cherry picked from commit 61d99e33e7
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git next)
Bug: 209599700
Change-Id: I413e2d7f514db507a1aca9a21f04115e97f72c00
Explicitly name the combination of SW0 | SW1 as reserved in the pte and
introduce a new PKVM_NOPAGE meta-state which, although not directly
stored in the software bits of the pte, can be used to represent an
entry for which there is no underlying page. This is distinct from an
invalid pte, as stage-2 identity mappings for the host are created
lazily and so an invalid pte there is the same as a valid mapping for
the purposes of ownership information.
This state will be used for permission checking during page transitions
in later patches.
Reviewed-by: Andrew Walbran <qwandor@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211215161232.1480836-9-qperret@google.com
(cherry picked from commit 3d467f7b8c
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git next)
Bug: 209599700
Change-Id: I2e803a6ae83694be178a3d3eaf75fe918c881659
In order to simplify the page tracking infrastructure at EL2 in nVHE
protected mode, move the responsibility of refcounting pages that are
shared multiple times on the host. In order to do so, let's create a
red-black tree tracking all the PFNs that have been shared, along with
a refcount.
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211215161232.1480836-8-qperret@google.com
(cherry picked from commit a83e2191b7
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git next)
Bug: 209599700
Change-Id: I58f39d139c84acb545676c99144547cd080cdda3
The create_hyp_mappings() function can currently be called at any point
in time. However, its behaviour in protected mode changes widely
depending on when it is being called. Prior to KVM init, it is used to
create the temporary page-table used to bring-up the hypervisor, and
later on it is transparently turned into a 'share' hypercall when the
kernel has lost control over the hypervisor stage-1. In order to prepare
the ground for also unsharing pages with the hypervisor during guest
teardown, introduce a kvm_share_hyp() function to make it clear in which
places a share hypercall should be expected, as we will soon need a
matching unshare hypercall in all those places.
[ qperret: BACKPORT due to resolution in kvm_arch_vcpu_run_map_fp() in
merge commit 2d761dbf7f ("Merge branch kvm-arm64/fpsimd-tracking
into kvmarm-master/next") ]
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211215161232.1480836-7-qperret@google.com
(cherry picked from commit 3f868e142c
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git next)
Bug: 209599700
Change-Id: Id3ba033a51550e5aada862d6af06773d0be833fb
kvm_pgtable_hyp_unmap() relies on the ->page_count() function callback
being provided by the memory-management operations for the page-table.
Wire up this callback for the hypervisor stage-1 page-table.
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211215161232.1480836-5-qperret@google.com
(cherry picked from commit 34ec7cbf1e
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git next)
Bug: 209599700
Change-Id: I481b1f722f4dddda72e35481aca725a937a52090
In nVHE-protected mode, the hyp stage-1 page-table refcount is broken
due to the lack of refcount support in the early allocator. Fix-up the
refcount in the finalize walker, once the 'hyp_vmemmap' is up and running.
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211215161232.1480836-4-qperret@google.com
(cherry picked from commit d6b4bd3f48
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git next)
Bug: 209599700
Change-Id: I6c48e4e5a0ae969e1d1ac6be0ce601faf138cfd3
To prepare the ground for allowing hyp stage-1 mappings to be removed at
run-time, update the KVM page-table code to maintain a correct refcount
using the ->{get,put}_page() function callbacks.
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211215161232.1480836-3-qperret@google.com
(cherry picked from commit 2ea2ff91e8
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git next)
Bug: 209599700
Change-Id: Ic4ce9177a4381a01e1f2c3b00e9f376e14015718
In nVHE protected mode, the EL2 code uses a temporary allocator during
boot while re-creating its stage-1 page-table. Unfortunately, the
hyp_vmmemap is not ready to use at this stage, so refcounting pages
is not possible. That is not currently a problem because hyp stage-1
mappings are never removed, which implies refcounting of page-table
pages is unnecessary.
In preparation for allowing hypervisor stage-1 mappings to be removed,
provide stub implementations for {get,put}_page() in the early allocator.
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211215161232.1480836-2-qperret@google.com
(cherry picked from commit 1fac3cfb9c
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git next)
Bug: 209599700
Change-Id: I9602cb4db9cef6f52e8beefa52e53cf71f2e0cd1
This reverts commit 2183d4b524.
This will be replaced by a FROMGIT patch shortly.
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: I1c7256d8992d29d38f122f5ad889ba904eda9793
This reverts commit b91c0e5d96.
This will be replaced by a FROMGIT patch shortly.
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: I34ad2c837d3d37cfe85b6a6e0fdba1368298fce6
This reverts commit 59b96bd25e.
This will be replaced by a FROMGIT patch shortly.
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: Icde87ddf560b90c55195f3c1250e6079b0795389
This reverts commit 75ed01242c.
This will be replaced by a FROMGIT patch shortly.
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: If7628be6ada702db6aa1326734fd70349aa1f3e4
This reverts commit d6d544e4af.
This will be replaced by a FROMGIT patch shortly.
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: Ie7351e6764d468a5b8a5c0ed663e50d8875914b3
This reverts commit 7c979b8271.
This will be replaced by a FROMGIT patch shortly.
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: I710c31254ce80a5b1de5ced791ccaec1b0bebd64
This reverts commit e41cc0f83a.
This will be replaced by a FROMGIT patch shortly.
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: Id418beae1a0d2d418c9e5167ae25105f231bdd68
This reverts commit 1d5d8f2493.
This will be replaced by a FROMGIT patch shortly.
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: I51fb6724d3a305800aa9d9226406c5d1b2e94331
This reverts commit 0a1b52003b.
This will be replaced by a FROMGIT patch shortly.
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: I9776a0847f6ccd15cf6ad5d62f0e81dd38aa0598
This reverts commit a8e6e3763a.
This will be replaced by a FROMGIT patch shortly.
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: Ibd16a2427b2de139bd7b2651a563238540949ecc
This reverts commit a39c93198b.
This will be replaced by a FROMGIT patch shortly.
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: Idf31ec997871d56bcb21b464e73b18a5cb50d49a
This reverts commit 92140facee.
This will be replaced by a FROMGIT patch shortly.
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: I4f6f88e9cee2d150fd5df2d584f8c5cdb4fc03cf
This reverts commit 326a987d71.
This will be replaced by a FROMGIT patch shortly.
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: Iab4f3808718db5f245696fdf5c96056280924c53
This reverts commit 206467fd5a.
This will be replaced by a FROMGIT patch shortly.
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: I4523460d30837024d2ddc58aab419e634fb299cd
Add CONFIG_MODULE_SIG_PROTECT to enable lookup for the protected
symbols and exports from the build time generated list of symbols
and exports.
Module loading behavior will change as follows:
- Allows Android GKI Modules signed using MODULE_SIG_ALL during build.
- Allows other modules to load if they don't violate the access to
Android GKI protected symbols and do not export the symbols already
exported by the Android GKI modules. Loading will fail and return
-EACCES (Permission denied) if symbol access contidions are not met.
Bug: 200082547
Test: Treehugger
Signed-off-by: Ramji Jiyani <ramjiyani@google.com>
Change-Id: Iedb99d8434db82a9c7f18ffd363d84f4b2316c5b
(cherry picked from commit 9ab6a24225)
Called By: KERNEL_SRC/kernel/Makefile if CONFIG_MODULE_SIG_PROTECT=y
Generates headers required by gki_modules.c from symbol lists:
gki_module_protected.h: from android/abi_gki_modules_protected
gki_module_exported.h: from android/abi_gki_modules_exports
Bug: 200082547
Test: Treehugger
Signed-off-by: Ramji Jiyani <ramjiyani@google.com>
Change-Id: Ibcc6e6fe0ad6c7850d48f7c0a283c7f9b06e4456
(cherry picked from commit 23cd26aab1)
Changes in 5.15.13
Input: i8042 - add deferred probe support
Input: i8042 - enable deferred probe quirk for ASUS UM325UA
tomoyo: Check exceeded quota early in tomoyo_domain_quota_is_ok().
tomoyo: use hwight16() in tomoyo_domain_quota_is_ok()
net/sched: Extend qdisc control block with tc control block
parisc: Clear stale IIR value on instruction access rights trap
platform/mellanox: mlxbf-pmc: Fix an IS_ERR() vs NULL bug in mlxbf_pmc_map_counters
platform/x86: apple-gmux: use resource_size() with res
memblock: fix memblock_phys_alloc() section mismatch error
ALSA: hda: intel-sdw-acpi: harden detection of controller
ALSA: hda: intel-sdw-acpi: go through HDAS ACPI at max depth of 2
recordmcount.pl: fix typo in s390 mcount regex
powerpc/ptdump: Fix DEBUG_WX since generic ptdump conversion
efi: Move efifb_setup_from_dmi() prototype from arch headers
selinux: initialize proto variable in selinux_ip_postroute_compat()
scsi: lpfc: Terminate string in lpfc_debugfs_nvmeio_trc_write()
net/mlx5: DR, Fix NULL vs IS_ERR checking in dr_domain_init_resources
net/mlx5: Fix error print in case of IRQ request failed
net/mlx5: Fix SF health recovery flow
net/mlx5: Fix tc max supported prio for nic mode
net/mlx5e: Wrap the tx reporter dump callback to extract the sq
net/mlx5e: Fix interoperability between XSK and ICOSQ recovery flow
net/mlx5e: Fix ICOSQ recovery flow for XSK
net/mlx5e: Use tc sample stubs instead of ifdefs in source file
net/mlx5e: Delete forward rule for ct or sample action
udp: using datalen to cap ipv6 udp max gso segments
selftests: Calculate udpgso segment count without header adjustment
sctp: use call_rcu to free endpoint
net/smc: fix using of uninitialized completions
net: usb: pegasus: Do not drop long Ethernet frames
net: ag71xx: Fix a potential double free in error handling paths
net: lantiq_xrx200: fix statistics of received bytes
NFC: st21nfca: Fix memory leak in device probe and remove
net/smc: don't send CDC/LLC message if link not ready
net/smc: fix kernel panic caused by race of smc_sock
igc: Do not enable crosstimestamping for i225-V models
igc: Fix TX timestamp support for non-MSI-X platforms
drm/amd/display: Send s0i2_rdy in stream_count == 0 optimization
drm/amd/display: Set optimize_pwr_state for DCN31
ionic: Initialize the 'lif->dbid_inuse' bitmap
net/mlx5e: Fix wrong features assignment in case of error
net: bridge: mcast: add and enforce query interval minimum
net: bridge: mcast: add and enforce startup query interval minimum
selftests/net: udpgso_bench_tx: fix dst ip argument
selftests: net: Fix a typo in udpgro_fwd.sh
net: bridge: mcast: fix br_multicast_ctx_vlan_global_disabled helper
net/ncsi: check for error return from call to nla_put_u32
selftests: net: using ping6 for IPv6 in udpgro_fwd.sh
fsl/fman: Fix missing put_device() call in fman_port_probe
i2c: validate user data in compat ioctl
nfc: uapi: use kernel size_t to fix user-space builds
uapi: fix linux/nfc.h userspace compilation errors
drm/nouveau: wait for the exclusive fence after the shared ones v2
drm/amdgpu: When the VCN(1.0) block is suspended, powergating is explicitly enabled
drm/amdgpu: add support for IP discovery gc_info table v2
drm/amd/display: Changed pipe split policy to allow for multi-display pipe split
xhci: Fresco FL1100 controller should not have BROKEN_MSI quirk set.
usb: gadget: f_fs: Clear ffs_eventfd in ffs_data_clear.
usb: mtu3: add memory barrier before set GPD's HWO
usb: mtu3: fix list_head check warning
usb: mtu3: set interval of FS intr and isoc endpoint
nitro_enclaves: Use get_user_pages_unlocked() call to handle mmap assert
binder: fix async_free_space accounting for empty parcels
scsi: vmw_pvscsi: Set residual data length conditionally
Input: appletouch - initialize work before device registration
Input: spaceball - fix parsing of movement data packets
mm/damon/dbgfs: fix 'struct pid' leaks in 'dbgfs_target_ids_write()'
net: fix use-after-free in tw_timer_handler
fs/mount_setattr: always cleanup mount_kattr
perf intel-pt: Fix parsing of VM time correlation arguments
perf script: Fix CPU filtering of a script's switch events
perf scripts python: intel-pt-events.py: Fix printing of switch events
Linux 5.15.13
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ie0d5130643ce211d5e21ea9d11bc5577efeddc90
LLVM_IAS=1 is implied by LLVM=1 as of LLVM_IAS as of v5.15-rc1 via
commit f12b034afe ("scripts/Makefile.clang: default to LLVM_IAS=1").
CROSS_COMPILE is no longer necessary when building with LLVM=1 as of
v5.15-rc1 via:
commit 231ad7f409 ("Makefile: infer --target from ARCH for CC=clang")
There is no need for CROSS_COMPILE_COMPAT as of v5.16-rc1 via
commit 3e6f8d1fa1 ("arm64: vdso32: require CROSS_COMPILE_COMPAT for gcc+bfd")
when LLVM=1 is set. This was backported to v5.15.12.
LINUX_GCC_CROSS_COMPILE_PREBUILTS_BIN was a temporary dependency until
LLVM_IAS=1 was enabled for all architectures. With LLVM_IAS=1 (implied
by LLVM=1), nothing in this directory is used and the variable can be
removed.
It doesn't hurt to respecify any of the above, but they are no longer
necessary.
Bug: 209655537
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Change-Id: I361b05ea9f36da933e6712150650f476b093d0a7
commit a78abde220 upstream.
Parser did not take ':' into account.
Example:
Before:
$ perf record -e intel_pt//u uname
Linux
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.026 MB perf.data ]
$ perf inject -i perf.data --vm-time-correlation="dry-run 123"
$ perf inject -i perf.data --vm-time-correlation="dry-run 123:456"
Failed to parse VM Time Correlation options
0x620 [0x98]: failed to process type: 70 [Invalid argument]
$
After:
$ perf inject -i perf.data --vm-time-correlation="dry-run 123:456"
$
Fixes: e3ff42bdeb ("perf intel-pt: Parse VM Time Correlation options and set up decoding")
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Riccardo Mancini <rickyman7@gmail.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20211215080636.149562-2-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 012e332286 upstream.
Make sure that finish_mount_kattr() is called after mount_kattr was
succesfully built in both the success and failure case to prevent
leaking any references we took when we built it. We returned early if
path lookup failed thereby risking to leak an additional reference we
took when building mount_kattr when an idmapped mount was requested.
Cc: linux-fsdevel@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: 9caccd4154 ("fs: introduce MOUNT_ATTR_IDMAP")
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e22e45fc9e upstream.
A real world panic issue was found as follow in Linux 5.4.
BUG: unable to handle page fault for address: ffffde49a863de28
PGD 7e6fe62067 P4D 7e6fe62067 PUD 7e6fe63067 PMD f51e064067 PTE 0
RIP: 0010:tw_timer_handler+0x20/0x40
Call Trace:
<IRQ>
call_timer_fn+0x2b/0x120
run_timer_softirq+0x1ef/0x450
__do_softirq+0x10d/0x2b8
irq_exit+0xc7/0xd0
smp_apic_timer_interrupt+0x68/0x120
apic_timer_interrupt+0xf/0x20
This issue was also reported since 2017 in the thread [1],
unfortunately, the issue was still can be reproduced after fixing
DCCP.
The ipv4_mib_exit_net is called before tcp_sk_exit_batch when a net
namespace is destroyed since tcp_sk_ops is registered befrore
ipv4_mib_ops, which means tcp_sk_ops is in the front of ipv4_mib_ops
in the list of pernet_list. There will be a use-after-free on
net->mib.net_statistics in tw_timer_handler after ipv4_mib_exit_net
if there are some inflight time-wait timers.
This bug is not introduced by commit f2bf415cfe ("mib: add net to
NET_ADD_STATS_BH") since the net_statistics is a global variable
instead of dynamic allocation and freeing. Actually, commit
61a7e26028 ("mib: put net statistics on struct net") introduces
the bug since it put net statistics on struct net and free it when
net namespace is destroyed.
Moving init_ipv4_mibs() to the front of tcp_init() to fix this bug
and replace pr_crit() with panic() since continuing is meaningless
when init_ipv4_mibs() fails.
[1] https://groups.google.com/g/syzkaller/c/p1tn-_Kc6l4/m/smuL_FMAAgAJ?pli=1
Fixes: 61a7e26028 ("mib: put net statistics on struct net")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: Cong Wang <cong.wang@bytedance.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20211228104145.9426-1-songmuchun@bytedance.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ebb3f994dd upstream.
DAMON debugfs interface increases the reference counts of 'struct pid's
for targets from the 'target_ids' file write callback
('dbgfs_target_ids_write()'), but decreases the counts only in DAMON
monitoring termination callback ('dbgfs_before_terminate()').
Therefore, when 'target_ids' file is repeatedly written without DAMON
monitoring start/termination, the reference count is not decreased and
therefore memory for the 'struct pid' cannot be freed. This commit
fixes this issue by decreasing the reference counts when 'target_ids' is
written.
Link: https://lkml.kernel.org/r/20211229124029.23348-1-sj@kernel.org
Fixes: 4bc05954d0 ("mm/damon: implement a debugfs-based user space interface")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> [5.15+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 142c779d05 upstream.
The PVSCSI implementation in the VMware hypervisor under specific
configuration ("SCSI Bus Sharing" set to "Physical") returns zero dataLen
in the completion descriptor for READ CAPACITY(16). As a result, the kernel
can not detect proper disk geometry. This can be recognized by the kernel
message:
[ 0.776588] sd 1:0:0:0: [sdb] Sector size 0 reported, assuming 512.
The PVSCSI implementation in QEMU does not set dataLen at all, keeping it
zeroed. This leads to a boot hang as was reported by Shmulik Ladkani.
It is likely that the controller returns the garbage at the end of the
buffer. Residual length should be set by the driver in that case. The SCSI
layer will erase corresponding data. See commit bdb2b8cab4 ("[SCSI] erase
invalid data returned by device") for details.
Commit e662502b3a ("scsi: vmw_pvscsi: Set correct residual data length")
introduced the issue by setting residual length unconditionally, causing
the SCSI layer to erase the useful payload beyond dataLen when this value
is returned as 0.
As a result, considering existing issues in implementations of PVSCSI
controllers, we do not want to call scsi_set_resid() when dataLen ==
0. Calling scsi_set_resid() has no effect if dataLen equals buffer length.
Link: https://lore.kernel.org/lkml/20210824120028.30d9c071@blondie/
Link: https://lore.kernel.org/r/20211220190514.55935-1-amakhalov@vmware.com
Fixes: e662502b3a ("scsi: vmw_pvscsi: Set correct residual data length")
Cc: Matt Wang <wwentao@vmware.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Vishal Bhakta <vbhakta@vmware.com>
Cc: VMware PV-Drivers <pv-drivers@vmware.com>
Cc: James E.J. Bottomley <jejb@linux.ibm.com>
Cc: linux-scsi@vger.kernel.org
Cc: stable@vger.kernel.org
Reported-and-suggested-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: Alexey Makhalov <amakhalov@vmware.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit cfd0d84ba2 upstream.
In 4.13, commit 74310e06be ("android: binder: Move buffer out of area shared with user space")
fixed a kernel structure visibility issue. As part of that patch,
sizeof(void *) was used as the buffer size for 0-length data payloads so
the driver could detect abusive clients sending 0-length asynchronous
transactions to a server by enforcing limits on async_free_size.
Unfortunately, on the "free" side, the accounting of async_free_space
did not add the sizeof(void *) back. The result was that up to 8-bytes of
async_free_space were leaked on every async transaction of 8-bytes or
less. These small transactions are uncommon, so this accounting issue
has gone undetected for several years.
The fix is to use "buffer_size" (the allocated buffer size) instead of
"size" (the logical buffer size) when updating the async_free_space
during the free operation. These are the same except for this
corner case of asynchronous transactions with payloads < 8 bytes.
Fixes: 74310e06be ("android: binder: Move buffer out of area shared with user space")
Signed-off-by: Todd Kjos <tkjos@google.com>
Cc: stable@vger.kernel.org # 4.14+
Link: https://lore.kernel.org/r/20211220190150.2107077-1-tkjos@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>