linux

mirror of https://github.com/hardkernel/linux.git synced 2026-04-01 18:53:02 +09:00

Author	SHA1	Message	Date
Oleg Nesterov	93c892f7ca	userfaultfd_release: always remove uffd flags and clear vm_userfaultfd_ctx commit `46d0b24c5e` upstream. userfaultfd_release() should clear vm_flags/vm_userfaultfd_ctx even if mm->core_state != NULL. Otherwise a page fault can see userfaultfd_missing() == T and use an already freed userfaultfd_ctx. Link: http://lkml.kernel.org/r/20190820160237.GB4983@redhat.com Fixes: `04f5866e41` ("coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping") Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Peter Xu <peterx@redhat.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 14:21:05 +09:00
Andrea Arcangeli	4da79de827	coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping commit `04f5866e41` upstream. The core dumping code has always run without holding the mmap_sem for writing, despite that is the only way to ensure that the entire vma layout will not change from under it. Only using some signal serialization on the processes belonging to the mm is not nearly enough. This was pointed out earlier. For example in Hugh's post from Jul 2017: https://lkml.kernel.org/r/alpine.LSU.2.11.1707191716030.2055@eggly.anvils "Not strictly relevant here, but a related note: I was very surprised to discover, only quite recently, how handle_mm_fault() may be called without down_read(mmap_sem) - when core dumping. That seems a misguided optimization to me, which would also be nice to correct" In particular because the growsdown and growsup can move the vm_start/vm_end the various loops the core dump does around the vma will not be consistent if page faults can happen concurrently. Pretty much all users calling mmget_not_zero()/get_task_mm() and then taking the mmap_sem had the potential to introduce unexpected side effects in the core dumping code. Adding mmap_sem for writing around the ->core_dump invocation is a viable long term fix, but it requires removing all copy user and page faults and to replace them with get_dump_page() for all binary formats which is not suitable as a short term fix. For the time being this solution manually covers the places that can confuse the core dump either by altering the vma layout or the vma flags while it runs. Once ->core_dump runs under mmap_sem for writing the function mmget_still_valid() can be dropped. Allowing mmap_sem protected sections to run in parallel with the coredump provides some minor parallelism advantage to the swapoff code (which seems to be safe enough by never mangling any vma field and can keep doing swapins in parallel to the core dumping) and to some other corner case. In order to facilitate the backporting I added "Fixes: 86039bd3b4e6" however the side effect of this same race condition in /proc/pid/mem should be reproducible since before 2.6.12-rc2 so I couldn't add any other "Fixes:" because there's no hash beyond the git genesis commit. Because find_extend_vma() is the only location outside of the process context that could modify the "mm" structures under mmap_sem for reading, by adding the mmget_still_valid() check to it, all other cases that take the mmap_sem for reading don't need the new check after mmget_not_zero()/get_task_mm(). The expand_stack() in page fault context also doesn't need the new check, because all tasks under core dumping are frozen. Link: http://lkml.kernel.org/r/20190325224949.11068-1-aarcange@redhat.com Fixes: `86039bd3b4` ("userfaultfd: add new syscall to provide memory externalization") Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Jann Horn <jannh@google.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Jann Horn <jannh@google.com> Acked-by: Jason Gunthorpe <jgg@mellanox.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [akaher@vmware.com: stable 4.9 backport - handle binder_update_page_range - mhocko@suse.com] Signed-off-by: Ajay Kaher <akaher@vmware.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 14:09:41 +09:00
Greg Kroah-Hartman	319c8e1bc7	Merge 4.9.71 into android-4.9 Changes in 4.9.71 mfd: fsl-imx25: Clean up irq settings during removal crypto: rsa - fix buffer overread when stripping leading zeroes crypto: hmac - require that the underlying hash algorithm is unkeyed crypto: salsa20 - fix blkcipher_walk API usage autofs: fix careless error in recent commit tracing: Allocate mask_str buffer dynamically USB: uas and storage: Add US_FL_BROKEN_FUA for another JMicron JMS567 ID USB: core: prevent malicious bNumInterfaces overflow usbip: fix stub_rx: get_pipe() to validate endpoint number usb: add helper to extract bits 12:11 of wMaxPacketSize usbip: fix stub_rx: harden CMD_SUBMIT path to handle malicious input usbip: fix stub_send_ret_submit() vulnerability to null transfer_buffer ceph: drop negative child dentries before try pruning inode's alias usb: xhci: fix TDS for MTK xHCI1.1 Bluetooth: btusb: driver to enable the usb-wakeup feature xhci: Don't add a virt_dev to the devs array before it's fully allocated nfs: don't wait on commit in nfs_commit_inode() if there were no commit requests sched/rt: Do not pull from current CPU if only one CPU to pull eeprom: at24: change nvmem stride to 1 dmaengine: dmatest: move callback wait queue to thread context ext4: fix fdatasync(2) after fallocate(2) operation ext4: fix crash when a directory's i_size is too small mac80211: Fix addition of mesh configuration element usb: phy: isp1301: Add OF device ID table KVM: nVMX: do not warn when MSR bitmap address is not backed usb: xhci-mtk: check hcc_params after adding primary hcd md-cluster: free md_cluster_info if node leave cluster userfaultfd: shmem: __do_fault requires VM_FAULT_NOPAGE userfaultfd: selftest: vm: allow to build in vm/ directory net: initialize msg.msg_flags in recvfrom bnxt_en: Ignore 0 value in autoneg supported speed from firmware. net: bcmgenet: correct the RBUF_OVFL_CNT and RBUF_ERR_CNT MIB values net: bcmgenet: correct MIB access of UniMAC RUNT counters net: bcmgenet: reserved phy revisions must be checked first net: bcmgenet: power down internal phy if open or resume fails net: bcmgenet: synchronize irq0 status between the isr and task net: bcmgenet: Power up the internal PHY before probing the MII rxrpc: Wake up the transmitter if Rx window size increases on the peer net/mlx5: Fix create autogroup prev initializer net/mlx5: Don't save PCI state when PCI error is detected iommu/io-pgtable-arm-v7s: Check for leaf entry before dereferencing it drm/amdgpu: fix parser init error path to avoid crash in parser fini NFSD: fix nfsd_minorversion(.., NFSD_AVAIL) NFSD: fix nfsd_reset_versions for NFSv4. Input: i8042 - add TUXEDO BU1406 (N24_25BU) to the nomux list drm/omap: fix dmabuf mmap for dma_alloc'ed buffers netfilter: bridge: honor frag_max_size when refragmenting ASoC: rsnd: fix sound route path when using SRC6/SRC9 blk-mq: Fix tagset reinit in the presence of cpu hot-unplug writeback: fix memory leak in wb_queue_work() net: wimax/i2400m: fix NULL-deref at probe dmaengine: Fix array index out of bounds warning in __get_unmap_pool() irqchip/mvebu-odmi: Select GENERIC_MSI_IRQ_DOMAIN net: Resend IGMP memberships upon peer notification. mlxsw: reg: Fix SPVM max record count mlxsw: reg: Fix SPVMLR max record count qed: Align CIDs according to DORQ requirement qed: Fix mapping leak on LL2 rx flow qed: Fix interrupt flags on Rx LL2 drm: amd: remove broken include path intel_th: pci: Add Gemini Lake support openrisc: fix issue handling 8 byte get_user calls ASoC: rcar: clear DE bit only in PDMACHCR when it stops scsi: hpsa: update check for logical volume status scsi: hpsa: limit outstanding rescans scsi: hpsa: do not timeout reset operations fjes: Fix wrong netdevice feature flags drm/radeon/si: add dpm quirk for Oland Drivers: hv: util: move waiting for release to hv_utils_transport itself iwlwifi: mvm: cleanup pending frames in DQA mode sched/deadline: Add missing update_rq_clock() in dl_task_timer() sched/deadline: Make sure the replenishment timer fires in the next period sched/deadline: Throttle a constrained deadline task activated after the deadline sched/deadline: Use deadline instead of period when calculating overflow mmc: mediatek: Fixed bug where clock frequency could be set wrong drm/radeon: reinstate oland workaround for sclk afs: Fix missing put_page() afs: Populate group ID from vnode status afs: Adjust mode bits processing afs: Deal with an empty callback array afs: Flush outstanding writes when an fd is closed afs: Migrate vlocation fields to 64-bit afs: Prevent callback expiry timer overflow afs: Fix the maths in afs_fs_store_data() afs: Invalid op ID should abort with RXGEN_OPCODE afs: Better abort and net error handling afs: Populate and use client modification time afs: Fix page leak in afs_write_begin() afs: Fix afs_kill_pages() afs: Fix abort on signal while waiting for call completion nvme-loop: fix a possible use-after-free when destroying the admin queue nvmet: confirm sq percpu has scheduled and switched to atomic nvmet-rdma: Fix a possible uninitialized variable dereference net/mlx4_core: Avoid delays during VF driver device shutdown net: mpls: Fix nexthop alive tracking on down events rxrpc: Ignore BUSY packets on old calls tty: don't panic on OOM in tty_set_ldisc() tty: fix data race in tty_ldisc_ref_wait() perf symbols: Fix symbols__fixup_end heuristic for corner cases efi/esrt: Cleanup bad memory map log messages NFSv4.1 respect server's max size in CREATE_SESSION btrfs: add missing memset while reading compressed inline extents target: Use system workqueue for ALUA transitions target: fix ALUA transition timeout handling target: fix race during implicit transition work flushes Revert "x86/acpi: Set persistent cpuid <-> nodeid mapping when booting" HID: cp2112: fix broken gpio_direction_input callback sfc: don't warn on successful change of MAC fbdev: controlfb: Add missing modes to fix out of bounds access video: udlfb: Fix read EDID timeout video: fbdev: au1200fb: Release some resources if a memory allocation fails video: fbdev: au1200fb: Return an error code if a memory allocation fails rtc: pcf8563: fix output clock rate ASoC: Intel: Skylake: Fix uuid_module memory leak in failure case dmaengine: ti-dma-crossbar: Correct am335x/am43xx mux value type PCI/PME: Handle invalid data when reading Root Status powerpc/powernv/cpufreq: Fix the frequency read by /proc/cpuinfo PCI: Do not allocate more buses than available in parent iommu/mediatek: Fix driver name netfilter: ipvs: Fix inappropriate output of procfs powerpc/opal: Fix EBUSY bug in acquiring tokens powerpc/ipic: Fix status get and status clear platform/x86: intel_punit_ipc: Fix resource ioremap warning target/iscsi: Fix a race condition in iscsit_add_reject_from_cmd() iscsi-target: fix memory leak in lio_target_tiqn_addtpg() target:fix condition return in core_pr_dump_initiator_port() target/file: Do not return error for UNMAP if length is zero badblocks: fix wrong return value in badblocks_set if badblocks are disabled iommu/amd: Limit the IOVA page range to the specified addresses xfs: truncate pagecache before writeback in xfs_setattr_size() arm-ccn: perf: Prevent module unload while PMU is in use crypto: tcrypt - fix buffer lengths in test_aead_speed() mm: Handle 0 flags in _calc_vm_trans() macro clk: mediatek: add the option for determining PLL source clock clk: imx6: refine hdmi_isfr's parent to make HDMI work on i.MX6 SoCs w/o VPU clk: hi6220: mark clock cs_atb_syspll as critical clk: tegra: Fix cclk_lp divisor register ppp: Destroy the mutex when cleanup ASoC: rsnd: rsnd_ssi_run_mods() needs to care ssi_parent_mod thermal/drivers/step_wise: Fix temperature regulation misbehavior scsi: scsi_debug: write_same: fix error report GFS2: Take inode off order_write list when setting jdata flag bcache: explicitly destroy mutex while exiting bcache: fix wrong cache_misses statistics Ib/hfi1: Return actual operational VLs in port info query arm64: prevent regressions in compressed kernel image size when upgrading to binutils 2.27 btrfs: tests: Fix a memory leak in error handling path in 'run_test()' platform/x86: hp_accel: Add quirk for HP ProBook 440 G4 nvme: use kref_get_unless_zero in nvme_find_get_ns l2tp: cleanup l2tp_tunnel_delete calls xfs: fix log block underflow during recovery cycle verification xfs: fix incorrect extent state in xfs_bmap_add_extent_unwritten_real RDMA/cxgb4: Declare stag as __be32 PCI: Detach driver before procfs & sysfs teardown on device remove scsi: hpsa: cleanup sas_phy structures in sysfs when unloading scsi: hpsa: destroy sas transport properties before scsi_host powerpc/perf/hv-24x7: Fix incorrect comparison in memord soc: mediatek: pwrap: fix compiler errors tty fix oops when rmmod 8250 usb: musb: da8xx: fix babble condition handling pinctrl: adi2: Fix Kconfig build problem raid5: Set R5_Expanded on parity devices as well as data. scsi: scsi_devinfo: Add REPORTLUN2 to EMC SYMMETRIX blacklist entry IB/core: Fix calculation of maximum RoCE MTU vt6655: Fix a possible sleep-in-atomic bug in vt6655_suspend rtl8188eu: Fix a possible sleep-in-atomic bug in rtw_createbss_cmd rtl8188eu: Fix a possible sleep-in-atomic bug in rtw_disassoc_cmd scsi: sd: change manage_start_stop to bool in sysfs interface scsi: sd: change allow_restart to bool in sysfs interface scsi: bfa: integer overflow in debugfs udf: Avoid overflow when session starts at large offset macvlan: Only deliver one copy of the frame to the macvlan interface RDMA/cma: Avoid triggering undefined behavior IB/ipoib: Grab rtnl lock on heavy flush when calling ndo_open/stop icmp: don't fail on fragment reassembly time exceeded ath9k: fix tx99 potential info leak Linux 4.9.71 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2017-12-20 10:51:15 +01:00
Andrea Arcangeli	275314e90c	userfaultfd: shmem: __do_fault requires VM_FAULT_NOPAGE [ Upstream commit `6bbc4a4144` ] __do_fault assumes vmf->page has been initialized and is valid if VM_FAULT_NOPAGE is not returned by vma->vm_ops->fault(vma, vmf). handle_userfault() in turn should return VM_FAULT_NOPAGE if it doesn't return VM_FAULT_SIGBUS or VM_FAULT_RETRY (the other two possibilities). This VM_FAULT_NOPAGE case is only invoked when signal are pending and it didn't matter for anonymous memory before. It only started to matter since shmem was introduced. hugetlbfs also takes a different path and doesn't exercise __do_fault. Link: http://lkml.kernel.org/r/20170228154201.GH5816@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2017-12-20 10:07:18 +01:00
Greg Kroah-Hartman	0d21cf1656	Merge 4.9.33 into android-4.9 Changes in 4.9.33 PCI/PM: Add needs_resume flag to avoid suspend complete optimization drm/i915: Prevent the system suspend complete optimization partitions/msdos: FreeBSD UFS2 file systems are not recognized netfilter: nf_conntrack_sip: fix wrong memory initialisation ibmvnic: Fix endian errors in error reporting output ibmvnic: Fix endian error when requesting device capabilities net: xilinx_emaclite: fix freezes due to unordered I/O net: xilinx_emaclite: fix receive buffer overflow tcp: tcp_probe: use spin_lock_bh() ipv6: Handle IPv4-mapped src to in6addr_any dst. ipv6: Inhibit IPv4-mapped src address on the wire. tipc: Fix tipc_sk_reinit race conditions gfs2: Use rhashtable walk interface in glock_hash_walk NET: Fix /proc/net/arp for AX.25 ibmvnic: Call napi_disable instead of napi_enable in failure path ibmvnic: Initialize completion variables before starting work NET: mkiss: Fix panic net: hns: Fix the device being used for dma mapping during TX sierra_net: Skip validating irrelevant fields for IDLE LSIs sierra_net: Add support for IPv6 and Dual-Stack Link Sense Indications i2c: piix4: Request the SMBUS semaphore inside the mutex i2c: piix4: Fix request_region size powerpc/powernv: Properly set "host-ipi" on IPIs kernel/ucount.c: mark user_header with kmemleak_ignore() net: thunderx: Fix PHY autoneg for SGMII QLM mode ipv6: addrconf: fix generation of new temporary addresses vfio/spapr_tce: Set window when adding additional groups to container ipv6: Fix IPv6 packet loss in scenarios involving roaming + snooping switches ARM: defconfigs: make NF_CT_PROTO_SCTP and NF_CT_PROTO_UDPLITE built-in PM / runtime: Avoid false-positive warnings from might_sleep_if() jump label: pass kbuild_cflags when checking for asm goto support shmem: fix sleeping from atomic context kasan: respect /proc/sys/kernel/traceoff_on_warning log2: make order_base_2() behave correctly on const input value zero ethtool: do not vzalloc(0) on registers dump net: phy: Fix lack of reference count on PHY driver net: phy: Fix PHY module checks and NULL deref in phy_attach_direct() net: fix ndo_features_check/ndo_fix_features comment ordering fscache: Fix dead object requeue fscache: Clear outstanding writes when disabling a cookie FS-Cache: Initialise stores_lock in netfs cookie ipv6: fix flow labels when the traffic class is non-0 drm/nouveau: prevent userspace from deleting client object drm/nouveau/fence/g84-: protect against concurrent access to semaphore buffers net/mlx4_core: Avoid command timeouts during VF driver device shutdown gianfar: synchronize DMA API usage by free_skb_rx_queue w/ gfar_new_page pinctrl: baytrail: Rectify debounce support (part 2) cec: fix wrong last_la determination drm: prevent double-(un)registration for connectors drm: Don't race connector registration pinctrl: berlin-bg4ct: fix the value for "sd1a" of pin SCRD0_CRD_PRES net: adaptec: starfire: add checks for dma mapping errors drm/i915: Check for NULL i915_vma in intel_unpin_fb_obj() net/mlx5: E-Switch, Err when retrieving steering name-space fails net/mlx5: Return EOPNOTSUPP when failing to get steering name-space parisc, parport_gsc: Fixes for printk continuation lines net: phy: micrel: add support for KSZ8795 gtp: add genl family modules alias drm/nouveau: Intercept ACPI_VIDEO_NOTIFY_PROBE drm/nouveau: Rename acpi_work to hpd_work drm/nouveau: Handle fbcon suspend/resume in seperate worker drm/nouveau: Don't enabling polling twice on runtime resume drm/nouveau: Fix drm poll_helper handling drm/ast: Fixed system hanged if disable P2A ravb: unmap descriptors when freeing rings nfs: Fix "Don't increment lock sequence ID after NFS4ERR_MOVED" nvmet-rdma: Fix missing dma sync to nvme data structures r8152: avoid start_xmit to call napi_schedule during autosuspend r8152: check rx after napi is enabled r8152: re-schedule napi for tx r8152: fix rtl8152_post_reset function r8152: avoid start_xmit to schedule napi when napi is disabled net-next: ethernet: mediatek: change the compatible string bnxt_en: Fix bnxt_reset() in the slow path task. bnxt_en: Enhance autoneg support. bnxt_en: Fix RTNL lock usage on bnxt_update_link(). bnxt_en: Fix RTNL lock usage on bnxt_get_port_module_status(). sctp: sctp gso should set feature with NETIF_F_SG when calling skb_segment sctp: sctp_addr_id2transport should verify the addr before looking up assoc usb: musb: Fix external abort on non-linefetch for musb_irq_work() mn10300: fix build error of missing fpu_save() romfs: use different way to generate fsid for BLOCK or MTD frv: add atomic64_add_unless() frv: add missing atomic64 operations proc: add a schedule point in proc_pid_readdir() userfaultfd: fix SIGBUS resulting from false rwsem wakeups kernel/watchdog.c: move hardlockup detector to separate file kernel/watchdog.c: move shared definitions to nmi.h kernel/watchdog: prevent false hardlockup on overloaded system vhost/vsock: handle vhost_vq_init_access() error ARC: smp-boot: Decouple Non masters waiting API from jump to entry point ARCv2: smp-boot: wake_flag polling by non-Masters needs to be uncached tipc: ignore requests when the connection state is not CONNECTED tipc: fix connection refcount error tipc: add subscription refcount to avoid invalid delete tipc: fix nametbl_lock soft lockup at node/link events netfilter: nf_tables: fix set->nelems counting with no NLM_F_EXCL netfilter: nft_log: restrict the log prefix length to 127 RDMA/qedr: Dispatch port active event from qedr_add RDMA/qedr: Fix and simplify memory leak in PD alloc RDMA/qedr: Don't reset QP when queues aren't flushed RDMA/qedr: Don't spam dmesg if QP is in error state RDMA/qedr: Return max inline data in QP query result xtensa: don't use linux IRQ #0 s390/kvm: do not rely on the ILC on kvm host protection fauls drm/i915: Workaround VLV/CHV DSI scanline counter hardware fail drm/i915: Always recompute watermarks when distrust_bios_wm is set, v2. sparc64: make string buffers large enough Linux 4.9.33 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2017-06-27 10:43:04 +02:00
Andrea Arcangeli	dbd9eee1aa	userfaultfd: fix SIGBUS resulting from false rwsem wakeups [ Upstream commit `15a77c6fe4` ] With >=32 CPUs the userfaultfd selftest triggered a graceful but unexpected SIGBUS because VM_FAULT_RETRY was returned by handle_userfault() despite the UFFDIO_COPY wasn't completed. This seems caused by rwsem waking the thread blocked in handle_userfault() and we can't run up_read() before the wait_event sequence is complete. Keeping the wait_even sequence identical to the first one, would require running userfaultfd_must_wait() again to know if the loop should be repeated, and it would also require retaking the rwsem and revalidating the whole vma status. It seems simpler to wait the targeted wakeup so that if false wakeups materialize we still wait for our specific wakeup event, unless of course there are signals or the uffd was released. Debug code collecting the stack trace of the wakeup showed this: $ ./userfaultfd 100 99999 nr_pages: 25600, nr_pages_per_cpu: 800 bounces: 99998, mode: racing ver poll, userfaults: 32 35 90 232 30 138 69 82 34 30 139 40 40 31 20 19 43 13 15 28 27 38 21 43 56 22 1 17 31 8 4 2 bounces: 99997, mode: rnd ver poll, Bus error (core dumped) save_stack_trace+0x2b/0x50 try_to_wake_up+0x2a6/0x580 wake_up_q+0x32/0x70 rwsem_wake+0xe0/0x120 call_rwsem_wake+0x1b/0x30 up_write+0x3b/0x40 vm_mmap_pgoff+0x9c/0xc0 SyS_mmap_pgoff+0x1a9/0x240 SyS_mmap+0x22/0x30 entry_SYSCALL_64_fastpath+0x1f/0xbd 0xffffffffffffffff FAULT_FLAG_ALLOW_RETRY missing 70 CPU: 24 PID: 1054 Comm: userfaultfd Tainted: G W 4.8.0+ #30 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014 Call Trace: dump_stack+0xb8/0x112 handle_userfault+0x572/0x650 handle_mm_fault+0x12cb/0x1520 __do_page_fault+0x175/0x500 trace_do_page_fault+0x61/0x270 do_async_page_fault+0x19/0x90 async_page_fault+0x25/0x30 This always happens when the main userfault selftest thread is running clone() while glibc runs either mprotect or mmap (both taking mmap_sem down_write()) to allocate the thread stack of the background threads, while locking/userfault threads already run at full throttle and are susceptible to false wakeups that may cause handle_userfault() to return before than expected (which results in graceful SIGBUS at the next attempt). This was reproduced only with >=32 CPUs because the loop to start the thread where clone() is too quick with fewer CPUs, while with 32 CPUs there's already significant activity on ~32 locking and userfault threads when the last background threads are started with clone(). This >=32 CPUs SMP race condition is likely reproducible only with the selftest because of the much heavier userfault load it generates if compared to real apps. We'll have to allow "one more" VM_FAULT_RETRY for the WP support and a patch floating around that provides it also hidden this problem but in reality only is successfully at hiding the problem. False wakeups could still happen again the second time handle_userfault() is invoked, even if it's a so rare race condition that getting false wakeups twice in a row is impossible to reproduce. This full fix is needed for correctness, the only alternative would be to allow VM_FAULT_RETRY to be returned infinitely. With this fix the WP support can stick to a strict "one more" VM_FAULT_RETRY logic (no need of returning it infinite times to avoid the SIGBUS). Link: http://lkml.kernel.org/r/20170111005535.13832-2-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Shubham Kumar Sharma <shubham.kumar.sharma@oracle.com> Tested-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2017-06-17 06:41:56 +02:00
Colin Cross	3e4578f42f	ANDROID: mm: add a field to store names for private anonymous memory Userspace processes often have multiple allocators that each do anonymous mmaps to get memory. When examining memory usage of individual processes or systems as a whole, it is useful to be able to break down the various heaps that were allocated by each layer and examine their size, RSS, and physical memory usage. This patch adds a user pointer to the shared union in vm_area_struct that points to a null terminated string inside the user process containing a name for the vma. vmas that point to the same address will be merged, but vmas that point to equivalent strings at different addresses will not be merged. Userspace can set the name for a region of memory by calling prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name); Setting the name to NULL clears it. The names of named anonymous vmas are shown in /proc/pid/maps as [anon:<name>] and in /proc/pid/smaps in a new "Name" field that is only present for named vmas. If the userspace pointer is no longer valid all or part of the name will be replaced with "<fault>". The idea to store a userspace pointer to reduce the complexity within mm (at the expense of the complexity of reading /proc/pid/mem) came from Dave Hansen. This results in no runtime overhead in the mm subsystem other than comparing the anon_name pointers when considering vma merging. The pointer is stored in a union with fieds that are only used on file-backed mappings, so it does not increase memory usage. Includes fix from Jed Davis <jld@mozilla.com> for typo in prctl_set_vma_anon_name, which could attempt to set the name across two vmas at the same time due to a typo, which might corrupt the vma list. Fix it to use tmp instead of end to limit the name setting to a single vma at a time. Change-Id: I9aa7b6b5ef536cd780599ba4e2fba8ceebe8b59f Signed-off-by: Dmitry Shmidt <dimitrysh@google.com>	2017-01-27 13:52:21 -08:00
Kirill A. Shutemov	bae473a423	mm: introduce fault_env The idea borrowed from Peter's patch from patchset on speculative page faults[1]: Instead of passing around the endless list of function arguments, replace the lot with a single structure so we can change context without endless function signature changes. The changes are mostly mechanical with exception of faultaround code: filemap_map_pages() got reworked a bit. This patch is preparation for the next one. [1] http://lkml.kernel.org/r/20141020222841.302891540@infradead.org Link: http://lkml.kernel.org/r/1466021202-61880-9-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-07-26 16:19:19 -07:00
Oleg Nesterov	d2005e3f41	userfaultfd: don't pin the user memory in userfaultfd_file_create() userfaultfd_file_create() increments mm->mm_users; this means that the memory won't be unmapped/freed if mm owner exits/execs, and UFFDIO_COPY after that can populate the orphaned mm more. Change userfaultfd_file_create() and userfaultfd_ctx_put() to use mm->mm_count to pin mm_struct. This means that atomic_inc_not_zero(mm->mm_users) is needed when we are going to actually play with this memory. Except handle_userfault() path doesn't need this, the caller must already have a reference. The patch adds the new trivial helper, mmget_not_zero(), it can have more users. Link: http://lkml.kernel.org/r/20160516172254.GA8595@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-05-20 17:58:30 -07:00
Linus Torvalds	39680f50ae	userfaultfd: don't block on the last VM updates at exit time The exit path will do some final updates to the VM of an exiting process to inform others of the fact that the process is going away. That happens, for example, for robust futex state cleanup, but also if the parent has asked for a TID update when the process exits (we clear the child tid field in user space). However, at the time we do those final VM accesses, we've already stopped accepting signals, so the usual "stop waiting for userfaults on signal" code in fs/userfaultfd.c no longer works, and the process can become an unkillable zombie waiting for something that will never happen. To solve this, just make handle_userfault() abort any user fault handling if we're already in the exit path past the signal handling state being dead (marked by PF_EXITING). This VM special case is pretty ugly, and it is possible that we should look at finalizing signals later (or move the VM final accesses earlier). But in the meantime this is a fairly minimally intrusive fix. Reported-and-tested-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-03-02 09:03:18 -08:00
Andrea Arcangeli	ac5be6b47e	userfaultfd: revert "userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key" This reverts commit `51360155ec` and adapts fs/userfaultfd.c to use the old version of that function. It didn't look robust to call __wake_up_common with "nr == 1" when we absolutely require wakeall semantics, but we've full control of what we insert in the two waitqueue heads of the blocked userfaults. No exclusive waitqueue risks to be inserted into those two waitqueue heads so we can as well stick to "nr == 1" of the old code and we can rely purely on the fact no waitqueue inserted in one of the two waitqueue heads we must enforce as wakeall, has wait->flags WQ_FLAG_EXCLUSIVE set. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Dr. David Alan Gilbert <dgilbert@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Shuah Khan <shuahkh@osg.samsung.com> Cc: Thierry Reding <treding@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-09-22 15:09:53 -07:00
Eric Biggers	c03e946fdd	userfaultfd: add missing mmput() in error path This fixes a memleak if anon_inode_getfile() fails in userfaultfd(). Signed-off-by: Eric Biggers <ebiggers3@gmail.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-09-17 21:16:07 -07:00
Andrea Arcangeli	2c5b7e1be7	userfaultfd: avoid missing wakeups during refile in userfaultfd_read During the refile in userfaultfd_read both waitqueues could look empty to the lockless wake_userfault(). Use a seqcount to prevent this false negative that could leave an userfault blocked. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-09-04 16:54:41 -07:00
Andrea Arcangeli	dfa37dc3fc	userfaultfd: allow signals to interrupt a userfault This is only simple to achieve if the userfault is going to return to userland (not to the kernel) because we can avoid returning VM_FAULT_RETRY despite we temporarily released the mmap_sem. The fault would just be retried by userland then. This is safe at least on x86 and powerpc (the two archs with the syscall implemented so far). Hint to verify for which archs this is safe: after handle_mm_fault returns, no access to data structures protected by the mmap_sem must be done by the fault code in arch/*/mm/fault.c until up_read(&mm->mmap_sem) is called. This has two main benefits: signals can run with lower latency in production (signals aren't blocked by userfaults and userfaults are immediately repeated after signal processing) and gdb can then trivially debug the threads blocked in this kind of userfaults coming directly from userland. On a side note: while gdb has a need to get signal processed, coredumps always worked perfectly with userfaults, no matter if the userfault is triggered by GUP a kernel copy_user or directly from userland. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-09-04 16:54:41 -07:00
Andrea Arcangeli	e6485a47b7	userfaultfd: require UFFDIO_API before other ioctls UFFDIO_API was already forced before read/poll could work. This makes the code more strict to force it also for all other ioctls. All users would already have been required to call UFFDIO_API before invoking other ioctls but this makes it more explicit. This will ensure we can change all ioctls (all but UFFDIO_API/struct uffdio_api) with a bump of uffdio_api.api. There's no actual plan or need to change the API or the ioctl, the current API already should cover fine even the non cooperative usage, but this is just for the longer term future just in case. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-09-04 16:54:41 -07:00
Andrea Arcangeli	ad465cae96	userfaultfd: UFFDIO_COPY and UFFDIO_ZEROPAGE These two ioctl allows to either atomically copy or to map zeropages into the virtual address space. This is used by the thread that opened the userfaultfd to resolve the userfaults. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-09-04 16:54:41 -07:00
Andrea Arcangeli	8d2afd96c2	userfaultfd: solve the race between UFFDIO_COPY\|ZEROPAGE and read Solve in-kernel the race between UFFDIO_COPY\|ZEROPAGE and userfaultfd_read if they are run on different threads simultaneously. Until now qemu solved the race in userland: the race was explicitly and intentionally left for userland to solve. However we can also solve it in kernel. Requiring all users to solve this race if they use two threads (one for the background transfer and one for the userfault reads) isn't very attractive from an API prospective, furthermore this allows to remove a whole bunch of mutex and bitmap code from qemu, making it faster. The cost of __get_user_pages_fast should be insignificant considering it scales perfectly and the pagetables are already hot in the CPU cache, compared to the overhead in userland to maintain those structures. Applying this patch is backwards compatible with respect to the userfaultfd userland API, however reverting this change wouldn't be backwards compatible anymore. Without this patch qemu in the background transfer thread, has to read the old state, and do UFFDIO_WAKE if old_state is missing but it become REQUESTED by the time it tries to set it to RECEIVED (signaling the other side received an userfault). vcpu background_thr userfault_thr ----- ----- ----- vcpu0 handle_mm_fault() postcopy_place_page read old_state -> MISSING UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending) vcpu0 fault at 0x7fb76a139000 enters handle_userfault poll() is kicked poll() -> POLLIN read() -> 0x7fb76a139000 postcopy_pmi_change_state(MISSING, REQUESTED) -> REQUESTED tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> REQUESTED /* check that no userfault raced with UFFDIO_COPY */ if (old_state == MISSING && tmp_state == REQUESTED) UFFDIO_WAKE from background thread And a second case where a UFFDIO_WAKE would be needed is in the userfault thread: vcpu background_thr userfault_thr ----- ----- ----- vcpu0 handle_mm_fault() postcopy_place_page read old_state -> MISSING UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending) tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> RECEIVED vcpu0 fault at 0x7fb76a139000 enters handle_userfault poll() is kicked poll() -> POLLIN read() -> 0x7fb76a139000 if (postcopy_pmi_change_state(MISSING, REQUESTED) == RECEIVED) UFFDIO_WAKE from userfault thread This patch removes the need of both UFFDIO_WAKE and of the associated per-page tristate as well. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-09-04 16:54:41 -07:00
Andrea Arcangeli	3004ec9cab	userfaultfd: allocate the userfaultfd_ctx cacheline aligned Use proper slab to guarantee alignment. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-09-04 16:54:41 -07:00
Andrea Arcangeli	15b726ef04	userfaultfd: optimize read() and poll() to be O(1) This makes read O(1) and poll that was already O(1) becomes lockless. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-09-04 16:54:41 -07:00
Andrea Arcangeli	ba85c702e4	userfaultfd: wake pending userfaults This is an optimization but it's a userland visible one and it affects the API. The downside of this optimization is that if you call poll() and you get POLLIN, read(ufd) may still return -EAGAIN. The blocked userfault may be waken by a different thread, before read(ufd) comes around. This in short means that poll() isn't really usable if the userfaultfd is opened in blocking mode. userfaults won't wait in "pending" state to be read anymore and any UFFDIO_WAKE or similar operations that has the objective of waking userfaults after their resolution, will wake all blocked userfaults for the resolved range, including those that haven't been read() by userland yet. The behavior of poll() becomes not standard, but this obviates the need of "spurious" UFFDIO_WAKE and it lets the userland threads to restart immediately without requiring an UFFDIO_WAKE. This is even more significant in case of repeated faults on the same address from multiple threads. This optimization is justified by the measurement that the number of spurious UFFDIO_WAKE accounts for 5% and 10% of the total userfaults for heavy workloads, so it's worth optimizing those away. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-09-04 16:54:41 -07:00
Andrea Arcangeli	a9b85f9415	userfaultfd: change the read API to return a uffd_msg I had requests to return the full address (not the page aligned one) to userland. It's not entirely clear how the page offset could be relevant because userfaults aren't like SIGBUS that can sigjump to a different place and it actually skip resolving the fault depending on a page offset. There's currently no real way to skip the fault especially because after a UFFDIO_COPY\|ZEROPAGE, the fault is optimized to be retried within the kernel without having to return to userland first (not even self modifying code replacing the .text that touched the faulting address would prevent the fault to be repeated). Userland cannot skip repeating the fault even more so if the fault was triggered by a KVM secondary page fault or any get_user_pages or any copy-user inside some syscall which will return to kernel code. The second time FAULT_FLAG_RETRY_NOWAIT won't be set leading to a SIGBUS being raised because the userfault can't wait if it cannot release the mmap_map first (and FAULT_FLAG_RETRY_NOWAIT is required for that). Still returning userland a proper structure during the read() on the uffd, can allow to use the current UFFD_API for the future non-cooperative extensions too and it looks cleaner as well. Once we get additional fields there's no point to return the fault address page aligned anymore to reuse the bits below PAGE_SHIFT. The only downside is that the read() syscall will read 32bytes instead of 8bytes but that's not going to be measurable overhead. The total number of new events that can be extended or of new future bits for already shipped events, is limited to 64 by the features field of the uffdio_api structure. If more will be needed a bump of UFFD_API will be required. [akpm@linux-foundation.org: use __packed] Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-09-04 16:54:41 -07:00
Pavel Emelyanov	3f602d2724	userfaultfd: Rename uffd_api.bits into .features This is (seems to be) the minimal thing that is required to unblock standard uffd usage from the non-cooperative one. Now more bits can be added to the features field indicating e.g. UFFD_FEATURE_FORK and others needed for the latter use-case. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-09-04 16:54:41 -07:00
Andrea Arcangeli	86039bd3b4	userfaultfd: add new syscall to provide memory externalization Once an userfaultfd has been created and certain region of the process virtual address space have been registered into it, the thread responsible for doing the memory externalization can manage the page faults in userland by talking to the kernel using the userfaultfd protocol. poll() can be used to know when there are new pending userfaults to be read (POLLIN). Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-09-04 16:54:41 -07:00

23 Commits