mirror of
https://github.com/hardkernel/linux.git
synced 2026-04-01 18:53:02 +09:00
16ce487b86fe055ee2efb84ab53cd12802dc509e
23 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
93c892f7ca |
userfaultfd_release: always remove uffd flags and clear vm_userfaultfd_ctx
commit |
||
|
|
4da79de827 |
coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping
commit |
||
|
|
319c8e1bc7 |
Merge 4.9.71 into android-4.9
Changes in 4.9.71 mfd: fsl-imx25: Clean up irq settings during removal crypto: rsa - fix buffer overread when stripping leading zeroes crypto: hmac - require that the underlying hash algorithm is unkeyed crypto: salsa20 - fix blkcipher_walk API usage autofs: fix careless error in recent commit tracing: Allocate mask_str buffer dynamically USB: uas and storage: Add US_FL_BROKEN_FUA for another JMicron JMS567 ID USB: core: prevent malicious bNumInterfaces overflow usbip: fix stub_rx: get_pipe() to validate endpoint number usb: add helper to extract bits 12:11 of wMaxPacketSize usbip: fix stub_rx: harden CMD_SUBMIT path to handle malicious input usbip: fix stub_send_ret_submit() vulnerability to null transfer_buffer ceph: drop negative child dentries before try pruning inode's alias usb: xhci: fix TDS for MTK xHCI1.1 Bluetooth: btusb: driver to enable the usb-wakeup feature xhci: Don't add a virt_dev to the devs array before it's fully allocated nfs: don't wait on commit in nfs_commit_inode() if there were no commit requests sched/rt: Do not pull from current CPU if only one CPU to pull eeprom: at24: change nvmem stride to 1 dmaengine: dmatest: move callback wait queue to thread context ext4: fix fdatasync(2) after fallocate(2) operation ext4: fix crash when a directory's i_size is too small mac80211: Fix addition of mesh configuration element usb: phy: isp1301: Add OF device ID table KVM: nVMX: do not warn when MSR bitmap address is not backed usb: xhci-mtk: check hcc_params after adding primary hcd md-cluster: free md_cluster_info if node leave cluster userfaultfd: shmem: __do_fault requires VM_FAULT_NOPAGE userfaultfd: selftest: vm: allow to build in vm/ directory net: initialize msg.msg_flags in recvfrom bnxt_en: Ignore 0 value in autoneg supported speed from firmware. net: bcmgenet: correct the RBUF_OVFL_CNT and RBUF_ERR_CNT MIB values net: bcmgenet: correct MIB access of UniMAC RUNT counters net: bcmgenet: reserved phy revisions must be checked first net: bcmgenet: power down internal phy if open or resume fails net: bcmgenet: synchronize irq0 status between the isr and task net: bcmgenet: Power up the internal PHY before probing the MII rxrpc: Wake up the transmitter if Rx window size increases on the peer net/mlx5: Fix create autogroup prev initializer net/mlx5: Don't save PCI state when PCI error is detected iommu/io-pgtable-arm-v7s: Check for leaf entry before dereferencing it drm/amdgpu: fix parser init error path to avoid crash in parser fini NFSD: fix nfsd_minorversion(.., NFSD_AVAIL) NFSD: fix nfsd_reset_versions for NFSv4. Input: i8042 - add TUXEDO BU1406 (N24_25BU) to the nomux list drm/omap: fix dmabuf mmap for dma_alloc'ed buffers netfilter: bridge: honor frag_max_size when refragmenting ASoC: rsnd: fix sound route path when using SRC6/SRC9 blk-mq: Fix tagset reinit in the presence of cpu hot-unplug writeback: fix memory leak in wb_queue_work() net: wimax/i2400m: fix NULL-deref at probe dmaengine: Fix array index out of bounds warning in __get_unmap_pool() irqchip/mvebu-odmi: Select GENERIC_MSI_IRQ_DOMAIN net: Resend IGMP memberships upon peer notification. mlxsw: reg: Fix SPVM max record count mlxsw: reg: Fix SPVMLR max record count qed: Align CIDs according to DORQ requirement qed: Fix mapping leak on LL2 rx flow qed: Fix interrupt flags on Rx LL2 drm: amd: remove broken include path intel_th: pci: Add Gemini Lake support openrisc: fix issue handling 8 byte get_user calls ASoC: rcar: clear DE bit only in PDMACHCR when it stops scsi: hpsa: update check for logical volume status scsi: hpsa: limit outstanding rescans scsi: hpsa: do not timeout reset operations fjes: Fix wrong netdevice feature flags drm/radeon/si: add dpm quirk for Oland Drivers: hv: util: move waiting for release to hv_utils_transport itself iwlwifi: mvm: cleanup pending frames in DQA mode sched/deadline: Add missing update_rq_clock() in dl_task_timer() sched/deadline: Make sure the replenishment timer fires in the next period sched/deadline: Throttle a constrained deadline task activated after the deadline sched/deadline: Use deadline instead of period when calculating overflow mmc: mediatek: Fixed bug where clock frequency could be set wrong drm/radeon: reinstate oland workaround for sclk afs: Fix missing put_page() afs: Populate group ID from vnode status afs: Adjust mode bits processing afs: Deal with an empty callback array afs: Flush outstanding writes when an fd is closed afs: Migrate vlocation fields to 64-bit afs: Prevent callback expiry timer overflow afs: Fix the maths in afs_fs_store_data() afs: Invalid op ID should abort with RXGEN_OPCODE afs: Better abort and net error handling afs: Populate and use client modification time afs: Fix page leak in afs_write_begin() afs: Fix afs_kill_pages() afs: Fix abort on signal while waiting for call completion nvme-loop: fix a possible use-after-free when destroying the admin queue nvmet: confirm sq percpu has scheduled and switched to atomic nvmet-rdma: Fix a possible uninitialized variable dereference net/mlx4_core: Avoid delays during VF driver device shutdown net: mpls: Fix nexthop alive tracking on down events rxrpc: Ignore BUSY packets on old calls tty: don't panic on OOM in tty_set_ldisc() tty: fix data race in tty_ldisc_ref_wait() perf symbols: Fix symbols__fixup_end heuristic for corner cases efi/esrt: Cleanup bad memory map log messages NFSv4.1 respect server's max size in CREATE_SESSION btrfs: add missing memset while reading compressed inline extents target: Use system workqueue for ALUA transitions target: fix ALUA transition timeout handling target: fix race during implicit transition work flushes Revert "x86/acpi: Set persistent cpuid <-> nodeid mapping when booting" HID: cp2112: fix broken gpio_direction_input callback sfc: don't warn on successful change of MAC fbdev: controlfb: Add missing modes to fix out of bounds access video: udlfb: Fix read EDID timeout video: fbdev: au1200fb: Release some resources if a memory allocation fails video: fbdev: au1200fb: Return an error code if a memory allocation fails rtc: pcf8563: fix output clock rate ASoC: Intel: Skylake: Fix uuid_module memory leak in failure case dmaengine: ti-dma-crossbar: Correct am335x/am43xx mux value type PCI/PME: Handle invalid data when reading Root Status powerpc/powernv/cpufreq: Fix the frequency read by /proc/cpuinfo PCI: Do not allocate more buses than available in parent iommu/mediatek: Fix driver name netfilter: ipvs: Fix inappropriate output of procfs powerpc/opal: Fix EBUSY bug in acquiring tokens powerpc/ipic: Fix status get and status clear platform/x86: intel_punit_ipc: Fix resource ioremap warning target/iscsi: Fix a race condition in iscsit_add_reject_from_cmd() iscsi-target: fix memory leak in lio_target_tiqn_addtpg() target:fix condition return in core_pr_dump_initiator_port() target/file: Do not return error for UNMAP if length is zero badblocks: fix wrong return value in badblocks_set if badblocks are disabled iommu/amd: Limit the IOVA page range to the specified addresses xfs: truncate pagecache before writeback in xfs_setattr_size() arm-ccn: perf: Prevent module unload while PMU is in use crypto: tcrypt - fix buffer lengths in test_aead_speed() mm: Handle 0 flags in _calc_vm_trans() macro clk: mediatek: add the option for determining PLL source clock clk: imx6: refine hdmi_isfr's parent to make HDMI work on i.MX6 SoCs w/o VPU clk: hi6220: mark clock cs_atb_syspll as critical clk: tegra: Fix cclk_lp divisor register ppp: Destroy the mutex when cleanup ASoC: rsnd: rsnd_ssi_run_mods() needs to care ssi_parent_mod thermal/drivers/step_wise: Fix temperature regulation misbehavior scsi: scsi_debug: write_same: fix error report GFS2: Take inode off order_write list when setting jdata flag bcache: explicitly destroy mutex while exiting bcache: fix wrong cache_misses statistics Ib/hfi1: Return actual operational VLs in port info query arm64: prevent regressions in compressed kernel image size when upgrading to binutils 2.27 btrfs: tests: Fix a memory leak in error handling path in 'run_test()' platform/x86: hp_accel: Add quirk for HP ProBook 440 G4 nvme: use kref_get_unless_zero in nvme_find_get_ns l2tp: cleanup l2tp_tunnel_delete calls xfs: fix log block underflow during recovery cycle verification xfs: fix incorrect extent state in xfs_bmap_add_extent_unwritten_real RDMA/cxgb4: Declare stag as __be32 PCI: Detach driver before procfs & sysfs teardown on device remove scsi: hpsa: cleanup sas_phy structures in sysfs when unloading scsi: hpsa: destroy sas transport properties before scsi_host powerpc/perf/hv-24x7: Fix incorrect comparison in memord soc: mediatek: pwrap: fix compiler errors tty fix oops when rmmod 8250 usb: musb: da8xx: fix babble condition handling pinctrl: adi2: Fix Kconfig build problem raid5: Set R5_Expanded on parity devices as well as data. scsi: scsi_devinfo: Add REPORTLUN2 to EMC SYMMETRIX blacklist entry IB/core: Fix calculation of maximum RoCE MTU vt6655: Fix a possible sleep-in-atomic bug in vt6655_suspend rtl8188eu: Fix a possible sleep-in-atomic bug in rtw_createbss_cmd rtl8188eu: Fix a possible sleep-in-atomic bug in rtw_disassoc_cmd scsi: sd: change manage_start_stop to bool in sysfs interface scsi: sd: change allow_restart to bool in sysfs interface scsi: bfa: integer overflow in debugfs udf: Avoid overflow when session starts at large offset macvlan: Only deliver one copy of the frame to the macvlan interface RDMA/cma: Avoid triggering undefined behavior IB/ipoib: Grab rtnl lock on heavy flush when calling ndo_open/stop icmp: don't fail on fragment reassembly time exceeded ath9k: fix tx99 potential info leak Linux 4.9.71 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> |
||
|
|
275314e90c |
userfaultfd: shmem: __do_fault requires VM_FAULT_NOPAGE
[ Upstream commit
|
||
|
|
0d21cf1656 |
Merge 4.9.33 into android-4.9
Changes in 4.9.33 PCI/PM: Add needs_resume flag to avoid suspend complete optimization drm/i915: Prevent the system suspend complete optimization partitions/msdos: FreeBSD UFS2 file systems are not recognized netfilter: nf_conntrack_sip: fix wrong memory initialisation ibmvnic: Fix endian errors in error reporting output ibmvnic: Fix endian error when requesting device capabilities net: xilinx_emaclite: fix freezes due to unordered I/O net: xilinx_emaclite: fix receive buffer overflow tcp: tcp_probe: use spin_lock_bh() ipv6: Handle IPv4-mapped src to in6addr_any dst. ipv6: Inhibit IPv4-mapped src address on the wire. tipc: Fix tipc_sk_reinit race conditions gfs2: Use rhashtable walk interface in glock_hash_walk NET: Fix /proc/net/arp for AX.25 ibmvnic: Call napi_disable instead of napi_enable in failure path ibmvnic: Initialize completion variables before starting work NET: mkiss: Fix panic net: hns: Fix the device being used for dma mapping during TX sierra_net: Skip validating irrelevant fields for IDLE LSIs sierra_net: Add support for IPv6 and Dual-Stack Link Sense Indications i2c: piix4: Request the SMBUS semaphore inside the mutex i2c: piix4: Fix request_region size powerpc/powernv: Properly set "host-ipi" on IPIs kernel/ucount.c: mark user_header with kmemleak_ignore() net: thunderx: Fix PHY autoneg for SGMII QLM mode ipv6: addrconf: fix generation of new temporary addresses vfio/spapr_tce: Set window when adding additional groups to container ipv6: Fix IPv6 packet loss in scenarios involving roaming + snooping switches ARM: defconfigs: make NF_CT_PROTO_SCTP and NF_CT_PROTO_UDPLITE built-in PM / runtime: Avoid false-positive warnings from might_sleep_if() jump label: pass kbuild_cflags when checking for asm goto support shmem: fix sleeping from atomic context kasan: respect /proc/sys/kernel/traceoff_on_warning log2: make order_base_2() behave correctly on const input value zero ethtool: do not vzalloc(0) on registers dump net: phy: Fix lack of reference count on PHY driver net: phy: Fix PHY module checks and NULL deref in phy_attach_direct() net: fix ndo_features_check/ndo_fix_features comment ordering fscache: Fix dead object requeue fscache: Clear outstanding writes when disabling a cookie FS-Cache: Initialise stores_lock in netfs cookie ipv6: fix flow labels when the traffic class is non-0 drm/nouveau: prevent userspace from deleting client object drm/nouveau/fence/g84-: protect against concurrent access to semaphore buffers net/mlx4_core: Avoid command timeouts during VF driver device shutdown gianfar: synchronize DMA API usage by free_skb_rx_queue w/ gfar_new_page pinctrl: baytrail: Rectify debounce support (part 2) cec: fix wrong last_la determination drm: prevent double-(un)registration for connectors drm: Don't race connector registration pinctrl: berlin-bg4ct: fix the value for "sd1a" of pin SCRD0_CRD_PRES net: adaptec: starfire: add checks for dma mapping errors drm/i915: Check for NULL i915_vma in intel_unpin_fb_obj() net/mlx5: E-Switch, Err when retrieving steering name-space fails net/mlx5: Return EOPNOTSUPP when failing to get steering name-space parisc, parport_gsc: Fixes for printk continuation lines net: phy: micrel: add support for KSZ8795 gtp: add genl family modules alias drm/nouveau: Intercept ACPI_VIDEO_NOTIFY_PROBE drm/nouveau: Rename acpi_work to hpd_work drm/nouveau: Handle fbcon suspend/resume in seperate worker drm/nouveau: Don't enabling polling twice on runtime resume drm/nouveau: Fix drm poll_helper handling drm/ast: Fixed system hanged if disable P2A ravb: unmap descriptors when freeing rings nfs: Fix "Don't increment lock sequence ID after NFS4ERR_MOVED" nvmet-rdma: Fix missing dma sync to nvme data structures r8152: avoid start_xmit to call napi_schedule during autosuspend r8152: check rx after napi is enabled r8152: re-schedule napi for tx r8152: fix rtl8152_post_reset function r8152: avoid start_xmit to schedule napi when napi is disabled net-next: ethernet: mediatek: change the compatible string bnxt_en: Fix bnxt_reset() in the slow path task. bnxt_en: Enhance autoneg support. bnxt_en: Fix RTNL lock usage on bnxt_update_link(). bnxt_en: Fix RTNL lock usage on bnxt_get_port_module_status(). sctp: sctp gso should set feature with NETIF_F_SG when calling skb_segment sctp: sctp_addr_id2transport should verify the addr before looking up assoc usb: musb: Fix external abort on non-linefetch for musb_irq_work() mn10300: fix build error of missing fpu_save() romfs: use different way to generate fsid for BLOCK or MTD frv: add atomic64_add_unless() frv: add missing atomic64 operations proc: add a schedule point in proc_pid_readdir() userfaultfd: fix SIGBUS resulting from false rwsem wakeups kernel/watchdog.c: move hardlockup detector to separate file kernel/watchdog.c: move shared definitions to nmi.h kernel/watchdog: prevent false hardlockup on overloaded system vhost/vsock: handle vhost_vq_init_access() error ARC: smp-boot: Decouple Non masters waiting API from jump to entry point ARCv2: smp-boot: wake_flag polling by non-Masters needs to be uncached tipc: ignore requests when the connection state is not CONNECTED tipc: fix connection refcount error tipc: add subscription refcount to avoid invalid delete tipc: fix nametbl_lock soft lockup at node/link events netfilter: nf_tables: fix set->nelems counting with no NLM_F_EXCL netfilter: nft_log: restrict the log prefix length to 127 RDMA/qedr: Dispatch port active event from qedr_add RDMA/qedr: Fix and simplify memory leak in PD alloc RDMA/qedr: Don't reset QP when queues aren't flushed RDMA/qedr: Don't spam dmesg if QP is in error state RDMA/qedr: Return max inline data in QP query result xtensa: don't use linux IRQ #0 s390/kvm: do not rely on the ILC on kvm host protection fauls drm/i915: Workaround VLV/CHV DSI scanline counter hardware fail drm/i915: Always recompute watermarks when distrust_bios_wm is set, v2. sparc64: make string buffers large enough Linux 4.9.33 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> |
||
|
|
dbd9eee1aa |
userfaultfd: fix SIGBUS resulting from false rwsem wakeups
[ Upstream commit
|
||
|
|
3e4578f42f |
ANDROID: mm: add a field to store names for private anonymous memory
Userspace processes often have multiple allocators that each do anonymous mmaps to get memory. When examining memory usage of individual processes or systems as a whole, it is useful to be able to break down the various heaps that were allocated by each layer and examine their size, RSS, and physical memory usage. This patch adds a user pointer to the shared union in vm_area_struct that points to a null terminated string inside the user process containing a name for the vma. vmas that point to the same address will be merged, but vmas that point to equivalent strings at different addresses will not be merged. Userspace can set the name for a region of memory by calling prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name); Setting the name to NULL clears it. The names of named anonymous vmas are shown in /proc/pid/maps as [anon:<name>] and in /proc/pid/smaps in a new "Name" field that is only present for named vmas. If the userspace pointer is no longer valid all or part of the name will be replaced with "<fault>". The idea to store a userspace pointer to reduce the complexity within mm (at the expense of the complexity of reading /proc/pid/mem) came from Dave Hansen. This results in no runtime overhead in the mm subsystem other than comparing the anon_name pointers when considering vma merging. The pointer is stored in a union with fieds that are only used on file-backed mappings, so it does not increase memory usage. Includes fix from Jed Davis <jld@mozilla.com> for typo in prctl_set_vma_anon_name, which could attempt to set the name across two vmas at the same time due to a typo, which might corrupt the vma list. Fix it to use tmp instead of end to limit the name setting to a single vma at a time. Change-Id: I9aa7b6b5ef536cd780599ba4e2fba8ceebe8b59f Signed-off-by: Dmitry Shmidt <dimitrysh@google.com> |
||
|
|
bae473a423 |
mm: introduce fault_env
The idea borrowed from Peter's patch from patchset on speculative page faults[1]: Instead of passing around the endless list of function arguments, replace the lot with a single structure so we can change context without endless function signature changes. The changes are mostly mechanical with exception of faultaround code: filemap_map_pages() got reworked a bit. This patch is preparation for the next one. [1] http://lkml.kernel.org/r/20141020222841.302891540@infradead.org Link: http://lkml.kernel.org/r/1466021202-61880-9-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
d2005e3f41 |
userfaultfd: don't pin the user memory in userfaultfd_file_create()
userfaultfd_file_create() increments mm->mm_users; this means that the memory won't be unmapped/freed if mm owner exits/execs, and UFFDIO_COPY after that can populate the orphaned mm more. Change userfaultfd_file_create() and userfaultfd_ctx_put() to use mm->mm_count to pin mm_struct. This means that atomic_inc_not_zero(mm->mm_users) is needed when we are going to actually play with this memory. Except handle_userfault() path doesn't need this, the caller must already have a reference. The patch adds the new trivial helper, mmget_not_zero(), it can have more users. Link: http://lkml.kernel.org/r/20160516172254.GA8595@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
39680f50ae |
userfaultfd: don't block on the last VM updates at exit time
The exit path will do some final updates to the VM of an exiting process to inform others of the fact that the process is going away. That happens, for example, for robust futex state cleanup, but also if the parent has asked for a TID update when the process exits (we clear the child tid field in user space). However, at the time we do those final VM accesses, we've already stopped accepting signals, so the usual "stop waiting for userfaults on signal" code in fs/userfaultfd.c no longer works, and the process can become an unkillable zombie waiting for something that will never happen. To solve this, just make handle_userfault() abort any user fault handling if we're already in the exit path past the signal handling state being dead (marked by PF_EXITING). This VM special case is pretty ugly, and it is possible that we should look at finalizing signals later (or move the VM final accesses earlier). But in the meantime this is a fairly minimally intrusive fix. Reported-and-tested-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
ac5be6b47e |
userfaultfd: revert "userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key"
This reverts commit
|
||
|
|
c03e946fdd |
userfaultfd: add missing mmput() in error path
This fixes a memleak if anon_inode_getfile() fails in userfaultfd(). Signed-off-by: Eric Biggers <ebiggers3@gmail.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
2c5b7e1be7 |
userfaultfd: avoid missing wakeups during refile in userfaultfd_read
During the refile in userfaultfd_read both waitqueues could look empty to the lockless wake_userfault(). Use a seqcount to prevent this false negative that could leave an userfault blocked. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
dfa37dc3fc |
userfaultfd: allow signals to interrupt a userfault
This is only simple to achieve if the userfault is going to return to userland (not to the kernel) because we can avoid returning VM_FAULT_RETRY despite we temporarily released the mmap_sem. The fault would just be retried by userland then. This is safe at least on x86 and powerpc (the two archs with the syscall implemented so far). Hint to verify for which archs this is safe: after handle_mm_fault returns, no access to data structures protected by the mmap_sem must be done by the fault code in arch/*/mm/fault.c until up_read(&mm->mmap_sem) is called. This has two main benefits: signals can run with lower latency in production (signals aren't blocked by userfaults and userfaults are immediately repeated after signal processing) and gdb can then trivially debug the threads blocked in this kind of userfaults coming directly from userland. On a side note: while gdb has a need to get signal processed, coredumps always worked perfectly with userfaults, no matter if the userfault is triggered by GUP a kernel copy_user or directly from userland. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
e6485a47b7 |
userfaultfd: require UFFDIO_API before other ioctls
UFFDIO_API was already forced before read/poll could work. This makes the code more strict to force it also for all other ioctls. All users would already have been required to call UFFDIO_API before invoking other ioctls but this makes it more explicit. This will ensure we can change all ioctls (all but UFFDIO_API/struct uffdio_api) with a bump of uffdio_api.api. There's no actual plan or need to change the API or the ioctl, the current API already should cover fine even the non cooperative usage, but this is just for the longer term future just in case. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
ad465cae96 |
userfaultfd: UFFDIO_COPY and UFFDIO_ZEROPAGE
These two ioctl allows to either atomically copy or to map zeropages into the virtual address space. This is used by the thread that opened the userfaultfd to resolve the userfaults. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
8d2afd96c2 |
userfaultfd: solve the race between UFFDIO_COPY|ZEROPAGE and read
Solve in-kernel the race between UFFDIO_COPY|ZEROPAGE and
userfaultfd_read if they are run on different threads simultaneously.
Until now qemu solved the race in userland: the race was explicitly
and intentionally left for userland to solve. However we can also
solve it in kernel.
Requiring all users to solve this race if they use two threads (one
for the background transfer and one for the userfault reads) isn't
very attractive from an API prospective, furthermore this allows to
remove a whole bunch of mutex and bitmap code from qemu, making it
faster. The cost of __get_user_pages_fast should be insignificant
considering it scales perfectly and the pagetables are already hot in
the CPU cache, compared to the overhead in userland to maintain those
structures.
Applying this patch is backwards compatible with respect to the
userfaultfd userland API, however reverting this change wouldn't be
backwards compatible anymore.
Without this patch qemu in the background transfer thread, has to read
the old state, and do UFFDIO_WAKE if old_state is missing but it
become REQUESTED by the time it tries to set it to RECEIVED (signaling
the other side received an userfault).
vcpu background_thr userfault_thr
----- ----- -----
vcpu0 handle_mm_fault()
postcopy_place_page
read old_state -> MISSING
UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
vcpu0 fault at 0x7fb76a139000 enters handle_userfault
poll() is kicked
poll() -> POLLIN
read() -> 0x7fb76a139000
postcopy_pmi_change_state(MISSING, REQUESTED) -> REQUESTED
tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> REQUESTED
/* check that no userfault raced with UFFDIO_COPY */
if (old_state == MISSING && tmp_state == REQUESTED)
UFFDIO_WAKE from background thread
And a second case where a UFFDIO_WAKE would be needed is in the userfault thread:
vcpu background_thr userfault_thr
----- ----- -----
vcpu0 handle_mm_fault()
postcopy_place_page
read old_state -> MISSING
UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> RECEIVED
vcpu0 fault at 0x7fb76a139000 enters handle_userfault
poll() is kicked
poll() -> POLLIN
read() -> 0x7fb76a139000
if (postcopy_pmi_change_state(MISSING, REQUESTED) == RECEIVED)
UFFDIO_WAKE from userfault thread
This patch removes the need of both UFFDIO_WAKE and of the associated
per-page tristate as well.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
3004ec9cab |
userfaultfd: allocate the userfaultfd_ctx cacheline aligned
Use proper slab to guarantee alignment. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
15b726ef04 |
userfaultfd: optimize read() and poll() to be O(1)
This makes read O(1) and poll that was already O(1) becomes lockless. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
ba85c702e4 |
userfaultfd: wake pending userfaults
This is an optimization but it's a userland visible one and it affects the API. The downside of this optimization is that if you call poll() and you get POLLIN, read(ufd) may still return -EAGAIN. The blocked userfault may be waken by a different thread, before read(ufd) comes around. This in short means that poll() isn't really usable if the userfaultfd is opened in blocking mode. userfaults won't wait in "pending" state to be read anymore and any UFFDIO_WAKE or similar operations that has the objective of waking userfaults after their resolution, will wake all blocked userfaults for the resolved range, including those that haven't been read() by userland yet. The behavior of poll() becomes not standard, but this obviates the need of "spurious" UFFDIO_WAKE and it lets the userland threads to restart immediately without requiring an UFFDIO_WAKE. This is even more significant in case of repeated faults on the same address from multiple threads. This optimization is justified by the measurement that the number of spurious UFFDIO_WAKE accounts for 5% and 10% of the total userfaults for heavy workloads, so it's worth optimizing those away. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
a9b85f9415 |
userfaultfd: change the read API to return a uffd_msg
I had requests to return the full address (not the page aligned one) to userland. It's not entirely clear how the page offset could be relevant because userfaults aren't like SIGBUS that can sigjump to a different place and it actually skip resolving the fault depending on a page offset. There's currently no real way to skip the fault especially because after a UFFDIO_COPY|ZEROPAGE, the fault is optimized to be retried within the kernel without having to return to userland first (not even self modifying code replacing the .text that touched the faulting address would prevent the fault to be repeated). Userland cannot skip repeating the fault even more so if the fault was triggered by a KVM secondary page fault or any get_user_pages or any copy-user inside some syscall which will return to kernel code. The second time FAULT_FLAG_RETRY_NOWAIT won't be set leading to a SIGBUS being raised because the userfault can't wait if it cannot release the mmap_map first (and FAULT_FLAG_RETRY_NOWAIT is required for that). Still returning userland a proper structure during the read() on the uffd, can allow to use the current UFFD_API for the future non-cooperative extensions too and it looks cleaner as well. Once we get additional fields there's no point to return the fault address page aligned anymore to reuse the bits below PAGE_SHIFT. The only downside is that the read() syscall will read 32bytes instead of 8bytes but that's not going to be measurable overhead. The total number of new events that can be extended or of new future bits for already shipped events, is limited to 64 by the features field of the uffdio_api structure. If more will be needed a bump of UFFD_API will be required. [akpm@linux-foundation.org: use __packed] Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
3f602d2724 |
userfaultfd: Rename uffd_api.bits into .features
This is (seems to be) the minimal thing that is required to unblock standard uffd usage from the non-cooperative one. Now more bits can be added to the features field indicating e.g. UFFD_FEATURE_FORK and others needed for the latter use-case. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
86039bd3b4 |
userfaultfd: add new syscall to provide memory externalization
Once an userfaultfd has been created and certain region of the process virtual address space have been registered into it, the thread responsible for doing the memory externalization can manage the page faults in userland by talking to the kernel using the userfaultfd protocol. poll() can be used to know when there are new pending userfaults to be read (POLLIN). Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |