And 'inode_no' field to /proc/<pid>/fdinfo/<FD> and
/proc/<pid>/task/<tid>/fdinfo/<FD>.
The inode numbers can be used to uniquely identify DMA buffers
in user space and avoids a dependency on /proc/<pid>/fd/* when
accounting per-process DMA buffer sizes.
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
[Kalesh Singh - Resolve conflict in fd/proc/fd.c]
Bug: 159126739
Bug: 167141117
Link: https://lore.kernel.org/lkml/20210208155315.1367371-2-kaleshsingh@google.com/
Change-Id: Ic9c551998832129051ada07374ed02da3248dc9c
Android captures per-process system memory state when certain low memory
events (e.g a foreground app kill) occur, to identify potential memory
hoggers. In order to measure how much memory a process actually consumes,
it is necessary to include the DMA buffer sizes for that process in the
memory accounting. Since the handle to DMA buffers are raw FDs, it is
important to be able to identify which processes have FD references to
a DMA buffer.
Currently, DMA buffer FDs can be accounted using /proc/<pid>/fd/* and
/proc/<pid>/fdinfo -- both are only readable by the process owner,
as follows:
1. Do a readlink on each FD.
2. If the target path begins with "/dmabuf", then the FD is a dmabuf FD.
3. stat the file to get the dmabuf inode number.
4. Read/ proc/<pid>/fdinfo/<fd>, to get the DMA buffer size.
Accessing other processes' fdinfo requires root privileges. This limits
the use of the interface to debugging environments and is not suitable
for production builds. Granting root privileges even to a system process
increases the attack surface and is highly undesirable.
Since fdinfo doesn't permit reading process memory and manipulating
process state, allow accessing fdinfo under PTRACE_MODE_READ_FSCRED.
Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Bug: 159126739
Bug: 167141117
Link: https://lore.kernel.org/lkml/20210208155315.1367371-1-kaleshsingh@google.com/
Change-Id: I41407760c7170621420739a044dbc27bdccac339
dl_add_task_root_domain() is called during sched domain rebuild:
rebuild_sched_domains_locked()
partition_and_rebuild_sched_domains()
rebuild_root_domains()
for all top_cpuset descendants:
update_tasks_root_domain()
for all tasks of cpuset:
dl_add_task_root_domain()
Change it so that only the task pi lock is taken to check if the task
has a SCHED_DEADLINE (DL) policy. In case that p is a DL task take the
rq lock as well to be able to safely de-reference root domain's DL
bandwidth structure.
Most of the tasks will have another policy (namely SCHED_NORMAL) and
can now bail without taking the rq lock.
One thing to note here: Even in case that there aren't any DL user
tasks, a slow frequency switching system with cpufreq gov schedutil has
a DL task (sugov) per frequency domain running which participates in DL
bandwidth management.
Bug: 174590586
(cherry picked from commit e6d97c38b59265499e94afb79eb5b2eb6c33edfb
git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/core)
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20210119083542.19856-1-dietmar.eggemann@arm.com
Change-Id: Ia28d47671c180e95c4a1b65e7aae99c7c0e9c72b
Changes in 5.10.15
USB: serial: cp210x: add pid/vid for WSDA-200-USB
USB: serial: cp210x: add new VID/PID for supporting Teraoka AD2000
USB: serial: option: Adding support for Cinterion MV31
usb: host: xhci: mvebu: make USB 3.0 PHY optional for Armada 3720
USB: gadget: legacy: fix an error code in eth_bind()
usb: gadget: aspeed: add missing of_node_put
USB: usblp: don't call usb_set_interface if there's a single alt
usb: renesas_usbhs: Clear pipe running flag in usbhs_pkt_pop()
usb: dwc2: Fix endpoint direction check in ep_from_windex
usb: dwc3: fix clock issue during resume in OTG mode
usb: xhci-mtk: fix unreleased bandwidth data
usb: xhci-mtk: skip dropping bandwidth of unchecked endpoints
usb: xhci-mtk: break loop when find the endpoint to drop
ARM: OMAP1: OSK: fix ohci-omap breakage
arm64: dts: qcom: c630: keep both touchpad devices enabled
Input: i8042 - unbreak Pegatron C15B
arm64: dts: amlogic: meson-g12: Set FL-adj property value
arm64: dts: rockchip: fix vopl iommu irq on px30
arm64: dts: rockchip: Use only supported PCIe link speed on Pinebook Pro
ARM: dts: stm32: Fix polarity of the DH DRC02 uSD card detect
ARM: dts: stm32: Connect card-detect signal on DHCOM
ARM: dts: stm32: Disable WP on DHCOM uSD slot
ARM: dts: stm32: Disable optional TSC2004 on DRC02 board
ARM: dts: stm32: Fix GPIO hog flags on DHCOM DRC02
vdpa/mlx5: Fix memory key MTT population
bpf, cgroup: Fix optlen WARN_ON_ONCE toctou
bpf, cgroup: Fix problematic bounds check
bpf, inode_storage: Put file handler if no storage was found
um: virtio: free vu_dev only with the contained struct device
bpf, preload: Fix build when $(O) points to a relative path
arm64: dts: meson: switch TFLASH_VDD_EN pin to open drain on Odroid-C4
r8169: work around RTL8125 UDP hw bug
rxrpc: Fix deadlock around release of dst cached on udp tunnel
arm64: dts: ls1046a: fix dcfg address range
SUNRPC: Fix NFS READs that start at non-page-aligned offsets
igc: set the default return value to -IGC_ERR_NVM in igc_write_nvm_srwr
igc: check return value of ret_val in igc_config_fc_after_link_up
i40e: Revert "i40e: don't report link up for a VF who hasn't enabled queues"
ibmvnic: device remove has higher precedence over reset
net/mlx5: Fix function calculation for page trees
net/mlx5: Fix leak upon failure of rule creation
net/mlx5e: Update max_opened_tc also when channels are closed
net/mlx5e: Release skb in case of failure in tc update skb
net: lapb: Copy the skb before sending a packet
net: mvpp2: TCAM entry enable should be written after SRAM data
r8169: fix WoL on shutdown if CONFIG_DEBUG_SHIRQ is set
net: ipa: pass correct dma_handle to dma_free_coherent()
ARM: dts: sun7i: a20: bananapro: Fix ethernet phy-mode
nvmet-tcp: fix out-of-bounds access when receiving multiple h2cdata PDUs
vdpa/mlx5: Restore the hardware used index after change map
memblock: do not start bottom-up allocations with kernel_end
kbuild: fix duplicated flags in DEBUG_CFLAGS
thunderbolt: Fix possible NULL pointer dereference in tb_acpi_add_link()
ovl: fix dentry leak in ovl_get_redirect
ovl: avoid deadlock on directory ioctl
ovl: implement volatile-specific fsync error behaviour
mac80211: fix station rate table updates on assoc
gpiolib: free device name on error path to fix kmemleak
fgraph: Initialize tracing_graph_pause at task creation
tracing/kprobe: Fix to support kretprobe events on unloaded modules
kretprobe: Avoid re-registration of the same kretprobe earlier
tracing: Use pause-on-trace with the latency tracers
tracepoint: Fix race between tracing and removing tracepoint
libnvdimm/namespace: Fix visibility of namespace resource attribute
libnvdimm/dimm: Avoid race between probe and available_slots_show()
genirq: Prevent [devm_]irq_alloc_desc from returning irq 0
genirq/msi: Activate Multi-MSI early when MSI_FLAG_ACTIVATE_EARLY is set
scripts: use pkg-config to locate libcrypto
xhci: fix bounce buffer usage for non-sg list case
RISC-V: Define MAXPHYSMEM_1GB only for RV32
cifs: report error instead of invalid when revalidating a dentry fails
iommu: Check dev->iommu in dev_iommu_priv_get() before dereferencing it
smb3: Fix out-of-bounds bug in SMB2_negotiate()
smb3: fix crediting for compounding when only one request in flight
mmc: sdhci-pltfm: Fix linking err for sdhci-brcmstb
mmc: core: Limit retries when analyse of SDIO tuples fails
Fix unsynchronized access to sev members through svm_register_enc_region
drm/dp/mst: Export drm_dp_get_vc_payload_bw()
drm/i915: Fix the MST PBN divider calculation
drm/i915/gem: Drop lru bumping on display unpinning
drm/i915/gt: Close race between enable_breadcrumbs and cancel_breadcrumbs
drm/i915/display: Prevent double YUV range correction on HDR planes
drm/i915: Extract intel_ddi_power_up_lanes()
drm/i915: Power up combo PHY lanes for for HDMI as well
drm/amd/display: Revert "Fix EDID parsing after resume from suspend"
io_uring: don't modify identity's files uncess identity is cowed
nvme-pci: avoid the deepest sleep state on Kingston A2000 SSDs
KVM: SVM: Treat SVM as unsupported when running as an SEV guest
KVM: x86/mmu: Fix TDP MMU zap collapsible SPTEs
KVM: x86: Allow guests to see MSR_IA32_TSX_CTRL even if tsx=off
KVM: x86: fix CPUID entries returned by KVM_GET_CPUID2 ioctl
KVM: x86: Update emulator context mode if SYSENTER xfers to 64-bit mode
KVM: x86: Set so called 'reserved CR3 bits in LM mask' at vCPU reset
DTS: ARM: gta04: remove legacy spi-cs-high to make display work again
ARM: dts; gta04: SPI panel chip select is active low
ARM: footbridge: fix dc21285 PCI configuration accessors
ARM: 9043/1: tegra: Fix misplaced tegra_uart_config in decompressor
mm: hugetlbfs: fix cannot migrate the fallocated HugeTLB page
mm: hugetlb: fix a race between freeing and dissolving the page
mm: hugetlb: fix a race between isolating and freeing page
mm: hugetlb: remove VM_BUG_ON_PAGE from page_huge_active
mm, compaction: move high_pfn to the for loop scope
mm/vmalloc: separate put pages and flush VM flags
mm: thp: fix MADV_REMOVE deadlock on shmem THP
mm/filemap: add missing mem_cgroup_uncharge() to __add_to_page_cache_locked()
x86/build: Disable CET instrumentation in the kernel
x86/debug: Fix DR6 handling
x86/debug: Prevent data breakpoints on __per_cpu_offset
x86/debug: Prevent data breakpoints on cpu_dr7
x86/apic: Add extra serialization for non-serializing MSRs
Input: goodix - add support for Goodix GT9286 chip
Input: xpad - sync supported devices with fork on GitHub
Input: ili210x - implement pressure reporting for ILI251x
md: Set prev_flush_start and flush_bio in an atomic way
igc: Report speed and duplex as unknown when device is runtime suspended
neighbour: Prevent a dead entry from updating gc_list
net: ip_tunnel: fix mtu calculation
udp: ipv4: manipulate network header of NATed UDP GRO fraglist
net: dsa: mv88e6xxx: override existent unicast portvec in port_fdb_add
net: sched: replaced invalid qdisc tree flush helper in qdisc_replace
Linux 5.10.15
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I15750357b4c30739515fdc0bbbd0e04b7c986171
commit c3df39ac9b upstream.
UDP/IP header of UDP GROed frag_skbs are not updated even after NAT
forwarding. Only the header of head_skb from ip_finish_output_gso ->
skb_gso_segment is updated but following frag_skbs are not updated.
A call path skb_mac_gso_segment -> inet_gso_segment ->
udp4_ufo_fragment -> __udp_gso_segment -> __udp_gso_segment_list
does not try to update UDP/IP header of the segment list but copy
only the MAC header.
Update port, addr and check of each skb of the segment list in
__udp_gso_segment_list. It covers both SNAT and DNAT.
Fixes: 9fd1ff5d2a (udp: Support UDP fraglist GRO/GSO.)
Signed-off-by: Dongseok Yi <dseok.yi@samsung.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Link: https://lore.kernel.org/r/1611962007-80092-1-git-send-email-dseok.yi@samsung.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit eb4e8fac00 upstream.
Following race condition was detected:
<CPU A, t0> - neigh_flush_dev() is under execution and calls
neigh_mark_dead(n) marking the neighbour entry 'n' as dead.
<CPU B, t1> - Executing: __netif_receive_skb() ->
__netif_receive_skb_core() -> arp_rcv() -> arp_process().arp_process()
calls __neigh_lookup() which takes a reference on neighbour entry 'n'.
<CPU A, t2> - Moves further along neigh_flush_dev() and calls
neigh_cleanup_and_release(n), but since reference count increased in t2,
'n' couldn't be destroyed.
<CPU B, t3> - Moves further along, arp_process() and calls
neigh_update()-> __neigh_update() -> neigh_update_gc_list(), which adds
the neighbour entry back in gc_list(neigh_mark_dead(), removed it
earlier in t0 from gc_list)
<CPU B, t4> - arp_process() finally calls neigh_release(n), destroying
the neighbour entry.
This leads to 'n' still being part of gc_list, but the actual
neighbour structure has been freed.
The situation can be prevented from happening if we disallow a dead
entry to have any possibility of updating gc_list. This is what the
patch intends to achieve.
Fixes: 9c29a2f55e ("neighbor: Fix locking order for gc_list changes")
Signed-off-by: Chinmay Agarwal <chinagar@codeaurora.org>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20210127165453.GA20514@chinagar-linux.qualcomm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit dc5d17a3c3 upstream.
One customer reports a crash problem which causes by flush request. It
triggers a warning before crash.
/* new request after previous flush is completed */
if (ktime_after(req_start, mddev->prev_flush_start)) {
WARN_ON(mddev->flush_bio);
mddev->flush_bio = bio;
bio = NULL;
}
The WARN_ON is triggered. We use spin lock to protect prev_flush_start and
flush_bio in md_flush_request. But there is no lock protection in
md_submit_flush_data. It can set flush_bio to NULL first because of
compiler reordering write instructions.
For example, flush bio1 sets flush bio to NULL first in
md_submit_flush_data. An interrupt or vmware causing an extended stall
happen between updating flush_bio and prev_flush_start. Because flush_bio
is NULL, flush bio2 can get the lock and submit to underlayer disks. Then
flush bio1 updates prev_flush_start after the interrupt or extended stall.
Then flush bio3 enters in md_flush_request. The start time req_start is
behind prev_flush_start. The flush_bio is not NULL(flush bio2 hasn't
finished). So it can trigger the WARN_ON now. Then it calls INIT_WORK
again. INIT_WORK() will re-initialize the list pointers in the
work_struct, which then can result in a corrupted work list and the
work_struct queued a second time. With the work list corrupted, it can
lead in invalid work items being used and cause a crash in
process_one_work.
We need to make sure only one flush bio can be handled at one same time.
So add spin lock in md_submit_flush_data to protect prev_flush_start and
flush_bio in an atomic way.
Reviewed-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 25a068b8e9 upstream.
Jan Kiszka reported that the x2apic_wrmsr_fence() function uses a plain
MFENCE while the Intel SDM (10.12.3 MSR Access in x2APIC Mode) calls for
MFENCE; LFENCE.
Short summary: we have special MSRs that have weaker ordering than all
the rest. Add fencing consistent with current SDM recommendations.
This is not known to cause any issues in practice, only in theory.
Longer story below:
The reason the kernel uses a different semantic is that the SDM changed
(roughly in late 2017). The SDM changed because folks at Intel were
auditing all of the recommended fences in the SDM and realized that the
x2apic fences were insufficient.
Why was the pain MFENCE judged insufficient?
WRMSR itself is normally a serializing instruction. No fences are needed
because the instruction itself serializes everything.
But, there are explicit exceptions for this serializing behavior written
into the WRMSR instruction documentation for two classes of MSRs:
IA32_TSC_DEADLINE and the X2APIC MSRs.
Back to x2apic: WRMSR is *not* serializing in this specific case.
But why is MFENCE insufficient? MFENCE makes writes visible, but
only affects load/store instructions. WRMSR is unfortunately not a
load/store instruction and is unaffected by MFENCE. This means that a
non-serializing WRMSR could be reordered by the CPU to execute before
the writes made visible by the MFENCE have even occurred in the first
place.
This means that an x2apic IPI could theoretically be triggered before
there is any (visible) data to process.
Does this affect anything in practice? I honestly don't know. It seems
quite possible that by the time an interrupt gets to consume the (not
yet) MFENCE'd data, it has become visible, mostly by accident.
To be safe, add the SDM-recommended fences for all x2apic WRMSRs.
This also leaves open the question of the _other_ weakly-ordered WRMSR:
MSR_IA32_TSC_DEADLINE. While it has the same ordering architecture as
the x2APIC MSRs, it seems substantially less likely to be a problem in
practice. While writes to the in-memory Local Vector Table (LVT) might
theoretically be reordered with respect to a weakly-ordered WRMSR like
TSC_DEADLINE, the SDM has this to say:
In x2APIC mode, the WRMSR instruction is used to write to the LVT
entry. The processor ensures the ordering of this write and any
subsequent WRMSR to the deadline; no fencing is required.
But, that might still leave xAPIC exposed. The safest thing to do for
now is to add the extra, recommended LFENCE.
[ bp: Massage commit message, fix typos, drop accidentally added
newline to tools/arch/x86/include/asm/barrier.h. ]
Reported-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200305174708.F77040DD@viggo.jf.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3943abf2db upstream.
local_db_save() is called at the start of exc_debug_kernel(), reads DR7 and
disables breakpoints to prevent recursion.
When running in a guest (X86_FEATURE_HYPERVISOR), local_db_save() reads the
per-cpu variable cpu_dr7 to check whether a breakpoint is active or not
before it accesses DR7.
A data breakpoint on cpu_dr7 therefore results in infinite #DB recursion.
Disallow data breakpoints on cpu_dr7 to prevent that.
Fixes: 84b6a3491567a("x86/entry: Optimize local_db_save() for virt")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210204152708.21308-2-jiangshanlai@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c4bed4b969 upstream.
When FSGSBASE is enabled, paranoid_entry() fetches the per-CPU GSBASE value
via __per_cpu_offset or pcpu_unit_offsets.
When a data breakpoint is set on __per_cpu_offset[cpu] (read-write
operation), the specific CPU will be stuck in an infinite #DB loop.
RCU will try to send an NMI to the specific CPU, but it is not working
either since NMI also relies on paranoid_entry(). Which means it's
undebuggable.
Fixes: eaad981291ee3("x86/entry/64: Introduce the FIND_PERCPU_BASE macro")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210204152708.21308-1-jiangshanlai@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 20bf2b3787 upstream.
With retpolines disabled, some configurations of GCC, and specifically
the GCC versions 9 and 10 in Ubuntu will add Intel CET instrumentation
to the kernel by default. That breaks certain tracing scenarios by
adding a superfluous ENDBR64 instruction before the fentry call, for
functions which can be called indirectly.
CET instrumentation isn't currently necessary in the kernel, as CET is
only supported in user space. Disable it unconditionally and move it
into the x86's Makefile as CET/CFI... enablement should be a per-arch
decision anyway.
[ bp: Massage and extend commit message. ]
Fixes: 29be86d7f9 ("kbuild: add -fcf-protection=none when using retpoline flags")
Reported-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Cc: <stable@vger.kernel.org>
Cc: Seth Forshee <seth.forshee@canonical.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Link: https://lkml.kernel.org/r/20210128215219.6kct3h2eiustncws@treble
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit da74240eb3 upstream.
Commit 3fea5a499d ("mm: memcontrol: convert page cache to a new
mem_cgroup_charge() API") introduced a bug in __add_to_page_cache_locked()
causing the following splat:
page dumped because: VM_BUG_ON_PAGE(page_memcg(page))
pages's memcg:ffff8889a4116000
------------[ cut here ]------------
kernel BUG at mm/memcontrol.c:2924!
invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 35 PID: 12345 Comm: cat Tainted: G S W I 5.11.0-rc4-debug+ #1
Hardware name: HP HP Z8 G4 Workstation/81C7, BIOS P60 v01.25 12/06/2017
RIP: commit_charge+0xf4/0x130
Call Trace:
mem_cgroup_charge+0x175/0x770
__add_to_page_cache_locked+0x712/0xad0
add_to_page_cache_lru+0xc5/0x1f0
cachefiles_read_or_alloc_pages+0x895/0x2e10 [cachefiles]
__fscache_read_or_alloc_pages+0x6c0/0xa00 [fscache]
__nfs_readpages_from_fscache+0x16d/0x630 [nfs]
nfs_readpages+0x24e/0x540 [nfs]
read_pages+0x5b1/0xc40
page_cache_ra_unbounded+0x460/0x750
generic_file_buffered_read_get_pages+0x290/0x1710
generic_file_buffered_read+0x2a9/0xc30
nfs_file_read+0x13f/0x230 [nfs]
new_sync_read+0x3af/0x610
vfs_read+0x339/0x4b0
ksys_read+0xf1/0x1c0
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Before that commit, there was a try_charge() and commit_charge() in
__add_to_page_cache_locked(). These two separated charge functions were
replaced by a single mem_cgroup_charge(). However, it forgot to add a
matching mem_cgroup_uncharge() when the xarray insertion failed with the
page released back to the pool.
Fix this by adding a mem_cgroup_uncharge() call when insertion error
happens.
Link: https://lkml.kernel.org/r/20210125042441.20030-1-longman@redhat.com
Fixes: 3fea5a499d ("mm: memcontrol: convert page cache to a new mem_cgroup_charge() API")
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <smuchun@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1c2f67308a upstream.
Sergey reported deadlock between kswapd correctly doing its usual
lock_page(page) followed by down_read(page->mapping->i_mmap_rwsem), and
madvise(MADV_REMOVE) on an madvise(MADV_HUGEPAGE) area doing
down_write(page->mapping->i_mmap_rwsem) followed by lock_page(page).
This happened when shmem_fallocate(punch hole)'s unmap_mapping_range()
reaches zap_pmd_range()'s call to __split_huge_pmd(). The same deadlock
could occur when partially truncating a mapped huge tmpfs file, or using
fallocate(FALLOC_FL_PUNCH_HOLE) on it.
__split_huge_pmd()'s page lock was added in 5.8, to make sure that any
concurrent use of reuse_swap_page() (holding page lock) could not catch
the anon THP's mapcounts and swapcounts while they were being split.
Fortunately, reuse_swap_page() is never applied to a shmem or file THP
(not even by khugepaged, which checks PageSwapCache before calling), and
anonymous THPs are never created in shmem or file areas: so that
__split_huge_pmd()'s page lock can only be necessary for anonymous THPs,
on which there is no risk of deadlock with i_mmap_rwsem.
Link: https://lkml.kernel.org/r/alpine.LSU.2.11.2101161409470.2022@eggly.anvils
Fixes: c444eb564f ("mm: thp: make the THP mapcount atomic against __split_huge_pmd_locked()")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 74e21484e4 upstream.
In fast_isolate_freepages, high_pfn will be used if a prefered one (ie
PFN >= low_fn) not found.
But the high_pfn is not reset before searching an free area, so when it
was used as freepage, it may from another free area searched before. As
a result move_freelist_head(freelist, freepage) will have unexpected
behavior (eg corrupt the MOVABLE freelist)
Unable to handle kernel paging request at virtual address dead000000000200
Mem abort info:
ESR = 0x96000044
Exception class = DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
Data abort info:
ISV = 0, ISS = 0x00000044
CM = 0, WnR = 1
[dead000000000200] address between user and kernel address ranges
-000|list_cut_before(inline)
-000|move_freelist_head(inline)
-000|fast_isolate_freepages(inline)
-000|isolate_freepages(inline)
-000|compaction_alloc(?, ?)
-001|unmap_and_move(inline)
-001|migrate_pages([NSD:0xFFFFFF80088CBBD0] from = 0xFFFFFF80088CBD88, [NSD:0xFFFFFF80088CBBC8] get_new_p
-002|__read_once_size(inline)
-002|static_key_count(inline)
-002|static_key_false(inline)
-002|trace_mm_compaction_migratepages(inline)
-002|compact_zone(?, [NSD:0xFFFFFF80088CBCB0] capc = 0x0)
-003|kcompactd_do_work(inline)
-003|kcompactd([X19] p = 0xFFFFFF93227FBC40)
-004|kthread([X20] _create = 0xFFFFFFE1AFB26380)
-005|ret_from_fork(asm)
The issue was reported on an smart phone product with 6GB ram and 3GB
zram as swap device.
This patch fixes the issue by reset high_pfn before searching each free
area, which ensure freepage and freelist match when call
move_freelist_head in fast_isolate_freepages().
Link: http://lkml.kernel.org/r/20190118175136.31341-12-mgorman@techsingularity.net
Link: https://lkml.kernel.org/r/20210112094720.1238444-1-wu-yan@tcl.com
Fixes: 5a811889de ("mm, compaction: use free lists to quickly locate a migration target")
Signed-off-by: Rokudo Yan <wu-yan@tcl.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0eb2df2b56 upstream.
There is a race between isolate_huge_page() and __free_huge_page().
CPU0: CPU1:
if (PageHuge(page))
put_page(page)
__free_huge_page(page)
spin_lock(&hugetlb_lock)
update_and_free_page(page)
set_compound_page_dtor(page,
NULL_COMPOUND_DTOR)
spin_unlock(&hugetlb_lock)
isolate_huge_page(page)
// trigger BUG_ON
VM_BUG_ON_PAGE(!PageHead(page), page)
spin_lock(&hugetlb_lock)
page_huge_active(page)
// trigger BUG_ON
VM_BUG_ON_PAGE(!PageHuge(page), page)
spin_unlock(&hugetlb_lock)
When we isolate a HugeTLB page on CPU0. Meanwhile, we free it to the
buddy allocator on CPU1. Then, we can trigger a BUG_ON on CPU0, because
it is already freed to the buddy allocator.
Link: https://lkml.kernel.org/r/20210115124942.46403-5-songmuchun@bytedance.com
Fixes: c8721bbbdd ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7ffddd499b upstream.
There is a race condition between __free_huge_page()
and dissolve_free_huge_page().
CPU0: CPU1:
// page_count(page) == 1
put_page(page)
__free_huge_page(page)
dissolve_free_huge_page(page)
spin_lock(&hugetlb_lock)
// PageHuge(page) && !page_count(page)
update_and_free_page(page)
// page is freed to the buddy
spin_unlock(&hugetlb_lock)
spin_lock(&hugetlb_lock)
clear_page_huge_active(page)
enqueue_huge_page(page)
// It is wrong, the page is already freed
spin_unlock(&hugetlb_lock)
The race window is between put_page() and dissolve_free_huge_page().
We should make sure that the page is already on the free list when it is
dissolved.
As a result __free_huge_page would corrupt page(s) already in the buddy
allocator.
Link: https://lkml.kernel.org/r/20210115124942.46403-4-songmuchun@bytedance.com
Fixes: c8721bbbdd ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 538eea5362 upstream.
The tegra_uart_config of the DEBUG_LL code is now placed right at the
start of the .text section after commit which enabled debug output in the
decompressor. Tegra devices are not booting anymore if DEBUG_LL is enabled
since tegra_uart_config data is executes as a code. Fix the misplaced
tegra_uart_config storage by embedding it into the code.
Cc: stable@vger.kernel.org
Fixes: 2596a72d33 ("ARM: 9009/1: uncompress: Enable debug in head.S")
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 39d3454c35 upstream.
Building with gcc 4.9.2 reveals a latent bug in the PCI accessors
for Footbridge platforms, which causes a fatal alignment fault
while accessing IO memory. Fix this by making the assembly volatile.
Cc: stable@vger.kernel.org
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 181739822c upstream.
With the arrival of
commit 2fee958319 ("spi: dt-bindings: clarify CS behavior for spi-cs-high and gpio descriptors")
it was clarified what the proper state for cs-gpios should be, even if the
flag is ignored. The driver code is doing the right thing since
766c6b63aa ("spi: fix client driver breakages when using GPIO descriptors")
The chip-select of the td028ttec1 panel is active-low, so we must omit spi-cs-high;
attribute (already removed by separate patch) and should now use GPIO_ACTIVE_LOW for
the client device description to be fully consistent.
Fixes: 766c6b63aa ("spi: fix client driver breakages when using GPIO descriptors")
CC: stable@vger.kernel.org
Signed-off-by: H. Nikolaus Schaller <hns@goldelico.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 07af7810e0 upstream.
This reverts
commit f1f028ff89 ("DTS: ARM: gta04: introduce legacy spi-cs-high to make display work again")
which had to be intruduced after
commit 6953c57ab1 ("gpio: of: Handle SPI chipselect legacy bindings")
broke the GTA04 display. This contradicted the data sheet but was the only
way to get it as an spi client operational again.
The panel data sheet defines the chip-select to be active low.
Now, with the arrival of
commit 766c6b63aa ("spi: fix client driver breakages when using GPIO descriptors")
the logic of interaction between spi-cs-high and the gpio descriptor flags
has been changed a second time, making the display broken again. So we have
to remove the original fix which in retrospect was a workaround of a bug in
the spi subsystem and not a feature of the panel or bug in the device tree.
With this fix the device tree is back in sync with the data sheet and
spi subsystem code.
Fixes: 766c6b63aa ("spi: fix client driver breakages when using GPIO descriptors")
CC: stable@vger.kernel.org
Signed-off-by: H. Nikolaus Schaller <hns@goldelico.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 943dea8af2 upstream.
Set the emulator context to PROT64 if SYSENTER transitions from 32-bit
userspace (compat mode) to a 64-bit kernel, otherwise the RIP update at
the end of x86_emulate_insn() will incorrectly truncate the new RIP.
Note, this bug is mostly limited to running an Intel virtual CPU model on
an AMD physical CPU, as other combinations of virtual and physical CPUs
do not trigger full emulation. On Intel CPUs, SYSENTER in compatibility
mode is legal, and unconditionally transitions to 64-bit mode. On AMD
CPUs, SYSENTER is illegal in compatibility mode and #UDs. If the vCPU is
AMD, KVM injects a #UD on SYSENTER in compat mode. If the pCPU is Intel,
SYSENTER will execute natively and not trigger #UD->VM-Exit (ignoring
guest TLB shenanigans).
Fixes: fede8076aa ("KVM: x86: handle wrap around 32-bit address space")
Cc: stable@vger.kernel.org
Signed-off-by: Jonny Barker <jonny@jonnybarker.com>
[sean: wrote changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210202165546.2390296-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 181f494888 upstream.
Recent commit 255cbecfe0 modified struct kvm_vcpu_arch to make
'cpuid_entries' a pointer to an array of kvm_cpuid_entry2 entries
rather than embedding the array in the struct. KVM_SET_CPUID and
KVM_SET_CPUID2 were updated accordingly, but KVM_GET_CPUID2 was missed.
As a result, KVM_GET_CPUID2 currently returns random fields from struct
kvm_vcpu_arch to userspace rather than the expected CPUID values. Fix
this by treating 'cpuid_entries' as a pointer when copying its
contents to userspace buffer.
Fixes: 255cbecfe0 ("KVM: x86: allocate vcpu->arch.cpuid_entries dynamically")
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Michael Roth <michael.roth@amd.com.com>
Message-Id: <20210128024451.1816770-1-michael.roth@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7131636e7e upstream.
Userspace that does not know about KVM_GET_MSR_FEATURE_INDEX_LIST
will generally use the default value for MSR_IA32_ARCH_CAPABILITIES.
When this happens and the host has tsx=on, it is possible to end up with
virtual machines that have HLE and RTM disabled, but TSX_CTRL available.
If the fleet is then switched to tsx=off, kvm_get_arch_capabilities()
will clear the ARCH_CAP_TSX_CTRL_MSR bit and it will not be possible to
use the tsx=off hosts as migration destinations, even though the guests
do not have TSX enabled.
To allow this migration, allow guests to write to their TSX_CTRL MSR,
while keeping the host MSR unchanged for the entire life of the guests.
This ensures that TSX remains disabled and also saves MSR reads and
writes, and it's okay to do because with tsx=off we know that guests will
not have the HLE and RTM features in their CPUID. (If userspace sets
bogus CPUID data, we do not expect HLE and RTM to work in guests anyway).
Cc: stable@vger.kernel.org
Fixes: cbbaa2727a ("KVM: x86: fix presentation of TSX feature in ARCH_CAPABILITIES")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ccd85d90ce upstream.
Don't let KVM load when running as an SEV guest, regardless of what
CPUID says. Memory is encrypted with a key that is not accessible to
the host (L0), thus it's impossible for L0 to emulate SVM, e.g. it'll
see garbage when reading the VMCB.
Technically, KVM could decrypt all memory that needs to be accessible to
the L0 and use shadow paging so that L0 does not need to shadow NPT, but
exposing such information to L0 largely defeats the purpose of running as
an SEV guest. This can always be revisited if someone comes up with a
use case for running VMs inside SEV guests.
Note, VMLOAD, VMRUN, etc... will also #GP on GPAs with C-bit set, i.e. KVM
is doomed even if the SEV guest is debuggable and the hypervisor is willing
to decrypt the VMCB. This may or may not be fixed on CPUs that have the
SVME_ADDR_CHK fix.
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210202212017.2486595-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e4747cb3ec upstream.
If we enable_breadcrumbs for a request while that request is being
removed from HW; we may see that the request is active as we take the
ce->signal_lock and proceed to attach the request to ce->signals.
However, during unsubmission after marking the request as inactive, we
see that the request has not yet been added to ce->signals and so skip
the removal. Pull the check during cancel_breadcrumbs under the same
spinlock as enabling so that we the two tests are consistent in
enable/cancel.
Otherwise, we may insert a request onto ce->signals that we expect should
not be there:
intel_context_remove_breadcrumbs:488 GEM_BUG_ON(!__i915_request_is_complete(rq))
While updating, we can note that we are always called with
irqs-disabled, due to the engine->active.lock being held at the single
caller, and so remove the irqsave/restore making it symmetric to
enable_breadcrumbs.
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/2931
Fixes: c18636f763 ("drm/i915: Remove requirement for holding i915_request.lock for breadcrumbs")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Andi Shyti <andi.shyti@intel.com>
Cc: <stable@vger.kernel.org> # v5.10+
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210119162057.31097-1-chris@chris-wilson.co.uk
(cherry picked from commit e7004ea4f5)
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 882554042d upstream.
Atm the driver will calculate a wrong MST timeslots/MTP (aka time unit)
value for MST streams if the link parameters (link rate or lane count)
are limited in a way independent of the sink capabilities (reported by
DPCD).
One example of such a limitation is when a MUX between the sink and
source connects only a limited number of lanes to the display and
connects the rest of the lanes to other peripherals (USB).
Another issue is that atm MST core calculates the divider based on the
backwards compatible DPCD (at address 0x0000) vs. the extended
capability info (at address 0x2200). This can result in leaving some
part of the MST BW unused (For instance in case of the WD19TB dock).
Fix the above two issues by calculating the PBN divider value based on
the rate and lane count link parameters that the driver uses for all
other computation.
Bugzilla: https://gitlab.freedesktop.org/drm/intel/-/issues/2977
Cc: Lyude Paul <lyude@redhat.com>
Cc: Ville Syrjala <ville.syrjala@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Imre Deak <imre.deak@intel.com>
Reviewed-by: Ville Syrjala <ville.syrjala@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210125173636.1733812-2-imre.deak@intel.com
(cherry picked from commit b59c27cab2)
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>