Commit Graph

23 Commits

Author SHA1 Message Date
Oleg Nesterov
93c892f7ca userfaultfd_release: always remove uffd flags and clear vm_userfaultfd_ctx
commit 46d0b24c5e upstream.

userfaultfd_release() should clear vm_flags/vm_userfaultfd_ctx even if
mm->core_state != NULL.

Otherwise a page fault can see userfaultfd_missing() == T and use an
already freed userfaultfd_ctx.

Link: http://lkml.kernel.org/r/20190820160237.GB4983@redhat.com
Fixes: 04f5866e41 ("coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping")
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-15 14:21:05 +09:00
Andrea Arcangeli
4da79de827 coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping
commit 04f5866e41 upstream.

The core dumping code has always run without holding the mmap_sem for
writing, despite that is the only way to ensure that the entire vma
layout will not change from under it.  Only using some signal
serialization on the processes belonging to the mm is not nearly enough.
This was pointed out earlier.  For example in Hugh's post from Jul 2017:

  https://lkml.kernel.org/r/alpine.LSU.2.11.1707191716030.2055@eggly.anvils

  "Not strictly relevant here, but a related note: I was very surprised
   to discover, only quite recently, how handle_mm_fault() may be called
   without down_read(mmap_sem) - when core dumping. That seems a
   misguided optimization to me, which would also be nice to correct"

In particular because the growsdown and growsup can move the
vm_start/vm_end the various loops the core dump does around the vma will
not be consistent if page faults can happen concurrently.

Pretty much all users calling mmget_not_zero()/get_task_mm() and then
taking the mmap_sem had the potential to introduce unexpected side
effects in the core dumping code.

Adding mmap_sem for writing around the ->core_dump invocation is a
viable long term fix, but it requires removing all copy user and page
faults and to replace them with get_dump_page() for all binary formats
which is not suitable as a short term fix.

For the time being this solution manually covers the places that can
confuse the core dump either by altering the vma layout or the vma flags
while it runs.  Once ->core_dump runs under mmap_sem for writing the
function mmget_still_valid() can be dropped.

Allowing mmap_sem protected sections to run in parallel with the
coredump provides some minor parallelism advantage to the swapoff code
(which seems to be safe enough by never mangling any vma field and can
keep doing swapins in parallel to the core dumping) and to some other
corner case.

In order to facilitate the backporting I added "Fixes: 86039bd3b4e6"
however the side effect of this same race condition in /proc/pid/mem
should be reproducible since before 2.6.12-rc2 so I couldn't add any
other "Fixes:" because there's no hash beyond the git genesis commit.

Because find_extend_vma() is the only location outside of the process
context that could modify the "mm" structures under mmap_sem for
reading, by adding the mmget_still_valid() check to it, all other cases
that take the mmap_sem for reading don't need the new check after
mmget_not_zero()/get_task_mm().  The expand_stack() in page fault
context also doesn't need the new check, because all tasks under core
dumping are frozen.

Link: http://lkml.kernel.org/r/20190325224949.11068-1-aarcange@redhat.com
Fixes: 86039bd3b4 ("userfaultfd: add new syscall to provide memory externalization")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Jann Horn <jannh@google.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jann Horn <jannh@google.com>
Acked-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[akaher@vmware.com: stable 4.9 backport
-  handle binder_update_page_range - mhocko@suse.com]
Signed-off-by: Ajay Kaher <akaher@vmware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-15 14:09:41 +09:00
Greg Kroah-Hartman
319c8e1bc7 Merge 4.9.71 into android-4.9
Changes in 4.9.71
	mfd: fsl-imx25: Clean up irq settings during removal
	crypto: rsa - fix buffer overread when stripping leading zeroes
	crypto: hmac - require that the underlying hash algorithm is unkeyed
	crypto: salsa20 - fix blkcipher_walk API usage
	autofs: fix careless error in recent commit
	tracing: Allocate mask_str buffer dynamically
	USB: uas and storage: Add US_FL_BROKEN_FUA for another JMicron JMS567 ID
	USB: core: prevent malicious bNumInterfaces overflow
	usbip: fix stub_rx: get_pipe() to validate endpoint number
	usb: add helper to extract bits 12:11 of wMaxPacketSize
	usbip: fix stub_rx: harden CMD_SUBMIT path to handle malicious input
	usbip: fix stub_send_ret_submit() vulnerability to null transfer_buffer
	ceph: drop negative child dentries before try pruning inode's alias
	usb: xhci: fix TDS for MTK xHCI1.1
	Bluetooth: btusb: driver to enable the usb-wakeup feature
	xhci: Don't add a virt_dev to the devs array before it's fully allocated
	nfs: don't wait on commit in nfs_commit_inode() if there were no commit requests
	sched/rt: Do not pull from current CPU if only one CPU to pull
	eeprom: at24: change nvmem stride to 1
	dmaengine: dmatest: move callback wait queue to thread context
	ext4: fix fdatasync(2) after fallocate(2) operation
	ext4: fix crash when a directory's i_size is too small
	mac80211: Fix addition of mesh configuration element
	usb: phy: isp1301: Add OF device ID table
	KVM: nVMX: do not warn when MSR bitmap address is not backed
	usb: xhci-mtk: check hcc_params after adding primary hcd
	md-cluster: free md_cluster_info if node leave cluster
	userfaultfd: shmem: __do_fault requires VM_FAULT_NOPAGE
	userfaultfd: selftest: vm: allow to build in vm/ directory
	net: initialize msg.msg_flags in recvfrom
	bnxt_en: Ignore 0 value in autoneg supported speed from firmware.
	net: bcmgenet: correct the RBUF_OVFL_CNT and RBUF_ERR_CNT MIB values
	net: bcmgenet: correct MIB access of UniMAC RUNT counters
	net: bcmgenet: reserved phy revisions must be checked first
	net: bcmgenet: power down internal phy if open or resume fails
	net: bcmgenet: synchronize irq0 status between the isr and task
	net: bcmgenet: Power up the internal PHY before probing the MII
	rxrpc: Wake up the transmitter if Rx window size increases on the peer
	net/mlx5: Fix create autogroup prev initializer
	net/mlx5: Don't save PCI state when PCI error is detected
	iommu/io-pgtable-arm-v7s: Check for leaf entry before dereferencing it
	drm/amdgpu: fix parser init error path to avoid crash in parser fini
	NFSD: fix nfsd_minorversion(.., NFSD_AVAIL)
	NFSD: fix nfsd_reset_versions for NFSv4.
	Input: i8042 - add TUXEDO BU1406 (N24_25BU) to the nomux list
	drm/omap: fix dmabuf mmap for dma_alloc'ed buffers
	netfilter: bridge: honor frag_max_size when refragmenting
	ASoC: rsnd: fix sound route path when using SRC6/SRC9
	blk-mq: Fix tagset reinit in the presence of cpu hot-unplug
	writeback: fix memory leak in wb_queue_work()
	net: wimax/i2400m: fix NULL-deref at probe
	dmaengine: Fix array index out of bounds warning in __get_unmap_pool()
	irqchip/mvebu-odmi: Select GENERIC_MSI_IRQ_DOMAIN
	net: Resend IGMP memberships upon peer notification.
	mlxsw: reg: Fix SPVM max record count
	mlxsw: reg: Fix SPVMLR max record count
	qed: Align CIDs according to DORQ requirement
	qed: Fix mapping leak on LL2 rx flow
	qed: Fix interrupt flags on Rx LL2
	drm: amd: remove broken include path
	intel_th: pci: Add Gemini Lake support
	openrisc: fix issue handling 8 byte get_user calls
	ASoC: rcar: clear DE bit only in PDMACHCR when it stops
	scsi: hpsa: update check for logical volume status
	scsi: hpsa: limit outstanding rescans
	scsi: hpsa: do not timeout reset operations
	fjes: Fix wrong netdevice feature flags
	drm/radeon/si: add dpm quirk for Oland
	Drivers: hv: util: move waiting for release to hv_utils_transport itself
	iwlwifi: mvm: cleanup pending frames in DQA mode
	sched/deadline: Add missing update_rq_clock() in dl_task_timer()
	sched/deadline: Make sure the replenishment timer fires in the next period
	sched/deadline: Throttle a constrained deadline task activated after the deadline
	sched/deadline: Use deadline instead of period when calculating overflow
	mmc: mediatek: Fixed bug where clock frequency could be set wrong
	drm/radeon: reinstate oland workaround for sclk
	afs: Fix missing put_page()
	afs: Populate group ID from vnode status
	afs: Adjust mode bits processing
	afs: Deal with an empty callback array
	afs: Flush outstanding writes when an fd is closed
	afs: Migrate vlocation fields to 64-bit
	afs: Prevent callback expiry timer overflow
	afs: Fix the maths in afs_fs_store_data()
	afs: Invalid op ID should abort with RXGEN_OPCODE
	afs: Better abort and net error handling
	afs: Populate and use client modification time
	afs: Fix page leak in afs_write_begin()
	afs: Fix afs_kill_pages()
	afs: Fix abort on signal while waiting for call completion
	nvme-loop: fix a possible use-after-free when destroying the admin queue
	nvmet: confirm sq percpu has scheduled and switched to atomic
	nvmet-rdma: Fix a possible uninitialized variable dereference
	net/mlx4_core: Avoid delays during VF driver device shutdown
	net: mpls: Fix nexthop alive tracking on down events
	rxrpc: Ignore BUSY packets on old calls
	tty: don't panic on OOM in tty_set_ldisc()
	tty: fix data race in tty_ldisc_ref_wait()
	perf symbols: Fix symbols__fixup_end heuristic for corner cases
	efi/esrt: Cleanup bad memory map log messages
	NFSv4.1 respect server's max size in CREATE_SESSION
	btrfs: add missing memset while reading compressed inline extents
	target: Use system workqueue for ALUA transitions
	target: fix ALUA transition timeout handling
	target: fix race during implicit transition work flushes
	Revert "x86/acpi: Set persistent cpuid <-> nodeid mapping when booting"
	HID: cp2112: fix broken gpio_direction_input callback
	sfc: don't warn on successful change of MAC
	fbdev: controlfb: Add missing modes to fix out of bounds access
	video: udlfb: Fix read EDID timeout
	video: fbdev: au1200fb: Release some resources if a memory allocation fails
	video: fbdev: au1200fb: Return an error code if a memory allocation fails
	rtc: pcf8563: fix output clock rate
	ASoC: Intel: Skylake: Fix uuid_module memory leak in failure case
	dmaengine: ti-dma-crossbar: Correct am335x/am43xx mux value type
	PCI/PME: Handle invalid data when reading Root Status
	powerpc/powernv/cpufreq: Fix the frequency read by /proc/cpuinfo
	PCI: Do not allocate more buses than available in parent
	iommu/mediatek: Fix driver name
	netfilter: ipvs: Fix inappropriate output of procfs
	powerpc/opal: Fix EBUSY bug in acquiring tokens
	powerpc/ipic: Fix status get and status clear
	platform/x86: intel_punit_ipc: Fix resource ioremap warning
	target/iscsi: Fix a race condition in iscsit_add_reject_from_cmd()
	iscsi-target: fix memory leak in lio_target_tiqn_addtpg()
	target:fix condition return in core_pr_dump_initiator_port()
	target/file: Do not return error for UNMAP if length is zero
	badblocks: fix wrong return value in badblocks_set if badblocks are disabled
	iommu/amd: Limit the IOVA page range to the specified addresses
	xfs: truncate pagecache before writeback in xfs_setattr_size()
	arm-ccn: perf: Prevent module unload while PMU is in use
	crypto: tcrypt - fix buffer lengths in test_aead_speed()
	mm: Handle 0 flags in _calc_vm_trans() macro
	clk: mediatek: add the option for determining PLL source clock
	clk: imx6: refine hdmi_isfr's parent to make HDMI work on i.MX6 SoCs w/o VPU
	clk: hi6220: mark clock cs_atb_syspll as critical
	clk: tegra: Fix cclk_lp divisor register
	ppp: Destroy the mutex when cleanup
	ASoC: rsnd: rsnd_ssi_run_mods() needs to care ssi_parent_mod
	thermal/drivers/step_wise: Fix temperature regulation misbehavior
	scsi: scsi_debug: write_same: fix error report
	GFS2: Take inode off order_write list when setting jdata flag
	bcache: explicitly destroy mutex while exiting
	bcache: fix wrong cache_misses statistics
	Ib/hfi1: Return actual operational VLs in port info query
	arm64: prevent regressions in compressed kernel image size when upgrading to binutils 2.27
	btrfs: tests: Fix a memory leak in error handling path in 'run_test()'
	platform/x86: hp_accel: Add quirk for HP ProBook 440 G4
	nvme: use kref_get_unless_zero in nvme_find_get_ns
	l2tp: cleanup l2tp_tunnel_delete calls
	xfs: fix log block underflow during recovery cycle verification
	xfs: fix incorrect extent state in xfs_bmap_add_extent_unwritten_real
	RDMA/cxgb4: Declare stag as __be32
	PCI: Detach driver before procfs & sysfs teardown on device remove
	scsi: hpsa: cleanup sas_phy structures in sysfs when unloading
	scsi: hpsa: destroy sas transport properties before scsi_host
	powerpc/perf/hv-24x7: Fix incorrect comparison in memord
	soc: mediatek: pwrap: fix compiler errors
	tty fix oops when rmmod 8250
	usb: musb: da8xx: fix babble condition handling
	pinctrl: adi2: Fix Kconfig build problem
	raid5: Set R5_Expanded on parity devices as well as data.
	scsi: scsi_devinfo: Add REPORTLUN2 to EMC SYMMETRIX blacklist entry
	IB/core: Fix calculation of maximum RoCE MTU
	vt6655: Fix a possible sleep-in-atomic bug in vt6655_suspend
	rtl8188eu: Fix a possible sleep-in-atomic bug in rtw_createbss_cmd
	rtl8188eu: Fix a possible sleep-in-atomic bug in rtw_disassoc_cmd
	scsi: sd: change manage_start_stop to bool in sysfs interface
	scsi: sd: change allow_restart to bool in sysfs interface
	scsi: bfa: integer overflow in debugfs
	udf: Avoid overflow when session starts at large offset
	macvlan: Only deliver one copy of the frame to the macvlan interface
	RDMA/cma: Avoid triggering undefined behavior
	IB/ipoib: Grab rtnl lock on heavy flush when calling ndo_open/stop
	icmp: don't fail on fragment reassembly time exceeded
	ath9k: fix tx99 potential info leak
	Linux 4.9.71

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2017-12-20 10:51:15 +01:00
Andrea Arcangeli
275314e90c userfaultfd: shmem: __do_fault requires VM_FAULT_NOPAGE
[ Upstream commit 6bbc4a4144 ]

__do_fault assumes vmf->page has been initialized and is valid if
VM_FAULT_NOPAGE is not returned by vma->vm_ops->fault(vma, vmf).

handle_userfault() in turn should return VM_FAULT_NOPAGE if it doesn't
return VM_FAULT_SIGBUS or VM_FAULT_RETRY (the other two possibilities).

This VM_FAULT_NOPAGE case is only invoked when signal are pending and it
didn't matter for anonymous memory before.  It only started to matter
since shmem was introduced.  hugetlbfs also takes a different path and
doesn't exercise __do_fault.

Link: http://lkml.kernel.org/r/20170228154201.GH5816@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20 10:07:18 +01:00
Greg Kroah-Hartman
0d21cf1656 Merge 4.9.33 into android-4.9
Changes in 4.9.33
	PCI/PM: Add needs_resume flag to avoid suspend complete optimization
	drm/i915: Prevent the system suspend complete optimization
	partitions/msdos: FreeBSD UFS2 file systems are not recognized
	netfilter: nf_conntrack_sip: fix wrong memory initialisation
	ibmvnic: Fix endian errors in error reporting output
	ibmvnic: Fix endian error when requesting device capabilities
	net: xilinx_emaclite: fix freezes due to unordered I/O
	net: xilinx_emaclite: fix receive buffer overflow
	tcp: tcp_probe: use spin_lock_bh()
	ipv6: Handle IPv4-mapped src to in6addr_any dst.
	ipv6: Inhibit IPv4-mapped src address on the wire.
	tipc: Fix tipc_sk_reinit race conditions
	gfs2: Use rhashtable walk interface in glock_hash_walk
	NET: Fix /proc/net/arp for AX.25
	ibmvnic: Call napi_disable instead of napi_enable in failure path
	ibmvnic: Initialize completion variables before starting work
	NET: mkiss: Fix panic
	net: hns: Fix the device being used for dma mapping during TX
	sierra_net: Skip validating irrelevant fields for IDLE LSIs
	sierra_net: Add support for IPv6 and Dual-Stack Link Sense Indications
	i2c: piix4: Request the SMBUS semaphore inside the mutex
	i2c: piix4: Fix request_region size
	powerpc/powernv: Properly set "host-ipi" on IPIs
	kernel/ucount.c: mark user_header with kmemleak_ignore()
	net: thunderx: Fix PHY autoneg for SGMII QLM mode
	ipv6: addrconf: fix generation of new temporary addresses
	vfio/spapr_tce: Set window when adding additional groups to container
	ipv6: Fix IPv6 packet loss in scenarios involving roaming + snooping switches
	ARM: defconfigs: make NF_CT_PROTO_SCTP and NF_CT_PROTO_UDPLITE built-in
	PM / runtime: Avoid false-positive warnings from might_sleep_if()
	jump label: pass kbuild_cflags when checking for asm goto support
	shmem: fix sleeping from atomic context
	kasan: respect /proc/sys/kernel/traceoff_on_warning
	log2: make order_base_2() behave correctly on const input value zero
	ethtool: do not vzalloc(0) on registers dump
	net: phy: Fix lack of reference count on PHY driver
	net: phy: Fix PHY module checks and NULL deref in phy_attach_direct()
	net: fix ndo_features_check/ndo_fix_features comment ordering
	fscache: Fix dead object requeue
	fscache: Clear outstanding writes when disabling a cookie
	FS-Cache: Initialise stores_lock in netfs cookie
	ipv6: fix flow labels when the traffic class is non-0
	drm/nouveau: prevent userspace from deleting client object
	drm/nouveau/fence/g84-: protect against concurrent access to semaphore buffers
	net/mlx4_core: Avoid command timeouts during VF driver device shutdown
	gianfar: synchronize DMA API usage by free_skb_rx_queue w/ gfar_new_page
	pinctrl: baytrail: Rectify debounce support (part 2)
	cec: fix wrong last_la determination
	drm: prevent double-(un)registration for connectors
	drm: Don't race connector registration
	pinctrl: berlin-bg4ct: fix the value for "sd1a" of pin SCRD0_CRD_PRES
	net: adaptec: starfire: add checks for dma mapping errors
	drm/i915: Check for NULL i915_vma in intel_unpin_fb_obj()
	net/mlx5: E-Switch, Err when retrieving steering name-space fails
	net/mlx5: Return EOPNOTSUPP when failing to get steering name-space
	parisc, parport_gsc: Fixes for printk continuation lines
	net: phy: micrel: add support for KSZ8795
	gtp: add genl family modules alias
	drm/nouveau: Intercept ACPI_VIDEO_NOTIFY_PROBE
	drm/nouveau: Rename acpi_work to hpd_work
	drm/nouveau: Handle fbcon suspend/resume in seperate worker
	drm/nouveau: Don't enabling polling twice on runtime resume
	drm/nouveau: Fix drm poll_helper handling
	drm/ast: Fixed system hanged if disable P2A
	ravb: unmap descriptors when freeing rings
	nfs: Fix "Don't increment lock sequence ID after NFS4ERR_MOVED"
	nvmet-rdma: Fix missing dma sync to nvme data structures
	r8152: avoid start_xmit to call napi_schedule during autosuspend
	r8152: check rx after napi is enabled
	r8152: re-schedule napi for tx
	r8152: fix rtl8152_post_reset function
	r8152: avoid start_xmit to schedule napi when napi is disabled
	net-next: ethernet: mediatek: change the compatible string
	bnxt_en: Fix bnxt_reset() in the slow path task.
	bnxt_en: Enhance autoneg support.
	bnxt_en: Fix RTNL lock usage on bnxt_update_link().
	bnxt_en: Fix RTNL lock usage on bnxt_get_port_module_status().
	sctp: sctp gso should set feature with NETIF_F_SG when calling skb_segment
	sctp: sctp_addr_id2transport should verify the addr before looking up assoc
	usb: musb: Fix external abort on non-linefetch for musb_irq_work()
	mn10300: fix build error of missing fpu_save()
	romfs: use different way to generate fsid for BLOCK or MTD
	frv: add atomic64_add_unless()
	frv: add missing atomic64 operations
	proc: add a schedule point in proc_pid_readdir()
	userfaultfd: fix SIGBUS resulting from false rwsem wakeups
	kernel/watchdog.c: move hardlockup detector to separate file
	kernel/watchdog.c: move shared definitions to nmi.h
	kernel/watchdog: prevent false hardlockup on overloaded system
	vhost/vsock: handle vhost_vq_init_access() error
	ARC: smp-boot: Decouple Non masters waiting API from jump to entry point
	ARCv2: smp-boot: wake_flag polling by non-Masters needs to be uncached
	tipc: ignore requests when the connection state is not CONNECTED
	tipc: fix connection refcount error
	tipc: add subscription refcount to avoid invalid delete
	tipc: fix nametbl_lock soft lockup at node/link events
	netfilter: nf_tables: fix set->nelems counting with no NLM_F_EXCL
	netfilter: nft_log: restrict the log prefix length to 127
	RDMA/qedr: Dispatch port active event from qedr_add
	RDMA/qedr: Fix and simplify memory leak in PD alloc
	RDMA/qedr: Don't reset QP when queues aren't flushed
	RDMA/qedr: Don't spam dmesg if QP is in error state
	RDMA/qedr: Return max inline data in QP query result
	xtensa: don't use linux IRQ #0
	s390/kvm: do not rely on the ILC on kvm host protection fauls
	drm/i915: Workaround VLV/CHV DSI scanline counter hardware fail
	drm/i915: Always recompute watermarks when distrust_bios_wm is set, v2.
	sparc64: make string buffers large enough
	Linux 4.9.33

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2017-06-27 10:43:04 +02:00
Andrea Arcangeli
dbd9eee1aa userfaultfd: fix SIGBUS resulting from false rwsem wakeups
[ Upstream commit 15a77c6fe4 ]

With >=32 CPUs the userfaultfd selftest triggered a graceful but
unexpected SIGBUS because VM_FAULT_RETRY was returned by
handle_userfault() despite the UFFDIO_COPY wasn't completed.

This seems caused by rwsem waking the thread blocked in
handle_userfault() and we can't run up_read() before the wait_event
sequence is complete.

Keeping the wait_even sequence identical to the first one, would require
running userfaultfd_must_wait() again to know if the loop should be
repeated, and it would also require retaking the rwsem and revalidating
the whole vma status.

It seems simpler to wait the targeted wakeup so that if false wakeups
materialize we still wait for our specific wakeup event, unless of
course there are signals or the uffd was released.

Debug code collecting the stack trace of the wakeup showed this:

  $ ./userfaultfd 100 99999
  nr_pages: 25600, nr_pages_per_cpu: 800
  bounces: 99998, mode: racing ver poll, userfaults: 32 35 90 232 30 138 69 82 34 30 139 40 40 31 20 19 43 13 15 28 27 38 21 43 56 22 1 17 31 8 4 2
  bounces: 99997, mode: rnd ver poll, Bus error (core dumped)

    save_stack_trace+0x2b/0x50
    try_to_wake_up+0x2a6/0x580
    wake_up_q+0x32/0x70
    rwsem_wake+0xe0/0x120
    call_rwsem_wake+0x1b/0x30
    up_write+0x3b/0x40
    vm_mmap_pgoff+0x9c/0xc0
    SyS_mmap_pgoff+0x1a9/0x240
    SyS_mmap+0x22/0x30
    entry_SYSCALL_64_fastpath+0x1f/0xbd
    0xffffffffffffffff
    FAULT_FLAG_ALLOW_RETRY missing 70
  CPU: 24 PID: 1054 Comm: userfaultfd Tainted: G        W       4.8.0+ #30
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
  Call Trace:
    dump_stack+0xb8/0x112
    handle_userfault+0x572/0x650
    handle_mm_fault+0x12cb/0x1520
    __do_page_fault+0x175/0x500
    trace_do_page_fault+0x61/0x270
    do_async_page_fault+0x19/0x90
    async_page_fault+0x25/0x30

This always happens when the main userfault selftest thread is running
clone() while glibc runs either mprotect or mmap (both taking mmap_sem
down_write()) to allocate the thread stack of the background threads,
while locking/userfault threads already run at full throttle and are
susceptible to false wakeups that may cause handle_userfault() to return
before than expected (which results in graceful SIGBUS at the next
attempt).

This was reproduced only with >=32 CPUs because the loop to start the
thread where clone() is too quick with fewer CPUs, while with 32 CPUs
there's already significant activity on ~32 locking and userfault
threads when the last background threads are started with clone().

This >=32 CPUs SMP race condition is likely reproducible only with the
selftest because of the much heavier userfault load it generates if
compared to real apps.

We'll have to allow "one more" VM_FAULT_RETRY for the WP support and a
patch floating around that provides it also hidden this problem but in
reality only is successfully at hiding the problem.

False wakeups could still happen again the second time
handle_userfault() is invoked, even if it's a so rare race condition
that getting false wakeups twice in a row is impossible to reproduce.
This full fix is needed for correctness, the only alternative would be
to allow VM_FAULT_RETRY to be returned infinitely.  With this fix the WP
support can stick to a strict "one more" VM_FAULT_RETRY logic (no need
of returning it infinite times to avoid the SIGBUS).

Link: http://lkml.kernel.org/r/20170111005535.13832-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Shubham Kumar Sharma <shubham.kumar.sharma@oracle.com>
Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-17 06:41:56 +02:00
Colin Cross
3e4578f42f ANDROID: mm: add a field to store names for private anonymous memory
Userspace processes often have multiple allocators that each do
anonymous mmaps to get memory.  When examining memory usage of
individual processes or systems as a whole, it is useful to be
able to break down the various heaps that were allocated by
each layer and examine their size, RSS, and physical memory
usage.

This patch adds a user pointer to the shared union in
vm_area_struct that points to a null terminated string inside
the user process containing a name for the vma.  vmas that
point to the same address will be merged, but vmas that
point to equivalent strings at different addresses will
not be merged.

Userspace can set the name for a region of memory by calling
prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name);
Setting the name to NULL clears it.

The names of named anonymous vmas are shown in /proc/pid/maps
as [anon:<name>] and in /proc/pid/smaps in a new "Name" field
that is only present for named vmas.  If the userspace pointer
is no longer valid all or part of the name will be replaced
with "<fault>".

The idea to store a userspace pointer to reduce the complexity
within mm (at the expense of the complexity of reading
/proc/pid/mem) came from Dave Hansen.  This results in no
runtime overhead in the mm subsystem other than comparing
the anon_name pointers when considering vma merging.  The pointer
is stored in a union with fieds that are only used on file-backed
mappings, so it does not increase memory usage.

Includes fix from Jed Davis <jld@mozilla.com> for typo in
prctl_set_vma_anon_name, which could attempt to set the name
across two vmas at the same time due to a typo, which might
corrupt the vma list.  Fix it to use tmp instead of end to limit
the name setting to a single vma at a time.

Change-Id: I9aa7b6b5ef536cd780599ba4e2fba8ceebe8b59f
Signed-off-by: Dmitry Shmidt <dimitrysh@google.com>
2017-01-27 13:52:21 -08:00
Kirill A. Shutemov
bae473a423 mm: introduce fault_env
The idea borrowed from Peter's patch from patchset on speculative page
faults[1]:

Instead of passing around the endless list of function arguments,
replace the lot with a single structure so we can change context without
endless function signature changes.

The changes are mostly mechanical with exception of faultaround code:
filemap_map_pages() got reworked a bit.

This patch is preparation for the next one.

[1] http://lkml.kernel.org/r/20141020222841.302891540@infradead.org

Link: http://lkml.kernel.org/r/1466021202-61880-9-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Oleg Nesterov
d2005e3f41 userfaultfd: don't pin the user memory in userfaultfd_file_create()
userfaultfd_file_create() increments mm->mm_users; this means that the
memory won't be unmapped/freed if mm owner exits/execs, and UFFDIO_COPY
after that can populate the orphaned mm more.

Change userfaultfd_file_create() and userfaultfd_ctx_put() to use
mm->mm_count to pin mm_struct.  This means that
atomic_inc_not_zero(mm->mm_users) is needed when we are going to
actually play with this memory.  Except handle_userfault() path doesn't
need this, the caller must already have a reference.

The patch adds the new trivial helper, mmget_not_zero(), it can have
more users.

Link: http://lkml.kernel.org/r/20160516172254.GA8595@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Linus Torvalds
39680f50ae userfaultfd: don't block on the last VM updates at exit time
The exit path will do some final updates to the VM of an exiting process
to inform others of the fact that the process is going away.

That happens, for example, for robust futex state cleanup, but also if
the parent has asked for a TID update when the process exits (we clear
the child tid field in user space).

However, at the time we do those final VM accesses, we've already
stopped accepting signals, so the usual "stop waiting for userfaults on
signal" code in fs/userfaultfd.c no longer works, and the process can
become an unkillable zombie waiting for something that will never
happen.

To solve this, just make handle_userfault() abort any user fault
handling if we're already in the exit path past the signal handling
state being dead (marked by PF_EXITING).

This VM special case is pretty ugly, and it is possible that we should
look at finalizing signals later (or move the VM final accesses
earlier).  But in the meantime this is a fairly minimally intrusive fix.

Reported-and-tested-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-02 09:03:18 -08:00
Andrea Arcangeli
ac5be6b47e userfaultfd: revert "userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key"
This reverts commit 51360155ec and adapts
fs/userfaultfd.c to use the old version of that function.

It didn't look robust to call __wake_up_common with "nr == 1" when we
absolutely require wakeall semantics, but we've full control of what we
insert in the two waitqueue heads of the blocked userfaults.  No
exclusive waitqueue risks to be inserted into those two waitqueue heads
so we can as well stick to "nr == 1" of the old code and we can rely
purely on the fact no waitqueue inserted in one of the two waitqueue
heads we must enforce as wakeall, has wait->flags WQ_FLAG_EXCLUSIVE set.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Cc: Thierry Reding <treding@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-22 15:09:53 -07:00
Eric Biggers
c03e946fdd userfaultfd: add missing mmput() in error path
This fixes a memleak if anon_inode_getfile() fails in userfaultfd().

Signed-off-by: Eric Biggers <ebiggers3@gmail.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-17 21:16:07 -07:00
Andrea Arcangeli
2c5b7e1be7 userfaultfd: avoid missing wakeups during refile in userfaultfd_read
During the refile in userfaultfd_read both waitqueues could look empty to
the lockless wake_userfault().  Use a seqcount to prevent this false
negative that could leave an userfault blocked.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Andrea Arcangeli
dfa37dc3fc userfaultfd: allow signals to interrupt a userfault
This is only simple to achieve if the userfault is going to return to
userland (not to the kernel) because we can avoid returning VM_FAULT_RETRY
despite we temporarily released the mmap_sem.  The fault would just be
retried by userland then.  This is safe at least on x86 and powerpc (the
two archs with the syscall implemented so far).

Hint to verify for which archs this is safe: after handle_mm_fault
returns, no access to data structures protected by the mmap_sem must be
done by the fault code in arch/*/mm/fault.c until up_read(&mm->mmap_sem)
is called.

This has two main benefits: signals can run with lower latency in
production (signals aren't blocked by userfaults and userfaults are
immediately repeated after signal processing) and gdb can then trivially
debug the threads blocked in this kind of userfaults coming directly from
userland.

On a side note: while gdb has a need to get signal processed, coredumps
always worked perfectly with userfaults, no matter if the userfault is
triggered by GUP a kernel copy_user or directly from userland.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Andrea Arcangeli
e6485a47b7 userfaultfd: require UFFDIO_API before other ioctls
UFFDIO_API was already forced before read/poll could work.  This makes the
code more strict to force it also for all other ioctls.

All users would already have been required to call UFFDIO_API before
invoking other ioctls but this makes it more explicit.

This will ensure we can change all ioctls (all but UFFDIO_API/struct
uffdio_api) with a bump of uffdio_api.api.

There's no actual plan or need to change the API or the ioctl, the current
API already should cover fine even the non cooperative usage, but this is
just for the longer term future just in case.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Andrea Arcangeli
ad465cae96 userfaultfd: UFFDIO_COPY and UFFDIO_ZEROPAGE
These two ioctl allows to either atomically copy or to map zeropages
into the virtual address space. This is used by the thread that opened
the userfaultfd to resolve the userfaults.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Andrea Arcangeli
8d2afd96c2 userfaultfd: solve the race between UFFDIO_COPY|ZEROPAGE and read
Solve in-kernel the race between UFFDIO_COPY|ZEROPAGE and
userfaultfd_read if they are run on different threads simultaneously.

Until now qemu solved the race in userland: the race was explicitly
and intentionally left for userland to solve. However we can also
solve it in kernel.

Requiring all users to solve this race if they use two threads (one
for the background transfer and one for the userfault reads) isn't
very attractive from an API prospective, furthermore this allows to
remove a whole bunch of mutex and bitmap code from qemu, making it
faster. The cost of __get_user_pages_fast should be insignificant
considering it scales perfectly and the pagetables are already hot in
the CPU cache, compared to the overhead in userland to maintain those
structures.

Applying this patch is backwards compatible with respect to the
userfaultfd userland API, however reverting this change wouldn't be
backwards compatible anymore.

Without this patch qemu in the background transfer thread, has to read
the old state, and do UFFDIO_WAKE if old_state is missing but it
become REQUESTED by the time it tries to set it to RECEIVED (signaling
the other side received an userfault).

    vcpu                background_thr userfault_thr
    -----               -----          -----
    vcpu0 handle_mm_fault()

                        postcopy_place_page
                        read old_state -> MISSING
                        UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)

    vcpu0 fault at 0x7fb76a139000 enters handle_userfault
    poll() is kicked

                                        poll() -> POLLIN
                                        read() -> 0x7fb76a139000
                                        postcopy_pmi_change_state(MISSING, REQUESTED) -> REQUESTED

                        tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> REQUESTED
                        /* check that no userfault raced with UFFDIO_COPY */
                        if (old_state == MISSING && tmp_state == REQUESTED)
                                UFFDIO_WAKE from background thread

And a second case where a UFFDIO_WAKE would be needed is in the userfault thread:

    vcpu                background_thr userfault_thr
    -----               -----          -----
    vcpu0 handle_mm_fault()

                        postcopy_place_page
                        read old_state -> MISSING
                        UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
                        tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> RECEIVED

    vcpu0 fault at 0x7fb76a139000 enters handle_userfault
    poll() is kicked

                                        poll() -> POLLIN
                                        read() -> 0x7fb76a139000

                                        if (postcopy_pmi_change_state(MISSING, REQUESTED) == RECEIVED)
                                                UFFDIO_WAKE from userfault thread

This patch removes the need of both UFFDIO_WAKE and of the associated
per-page tristate as well.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Andrea Arcangeli
3004ec9cab userfaultfd: allocate the userfaultfd_ctx cacheline aligned
Use proper slab to guarantee alignment.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Andrea Arcangeli
15b726ef04 userfaultfd: optimize read() and poll() to be O(1)
This makes read O(1) and poll that was already O(1) becomes lockless.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Andrea Arcangeli
ba85c702e4 userfaultfd: wake pending userfaults
This is an optimization but it's a userland visible one and it affects
the API.

The downside of this optimization is that if you call poll() and you
get POLLIN, read(ufd) may still return -EAGAIN. The blocked userfault
may be waken by a different thread, before read(ufd) comes
around. This in short means that poll() isn't really usable if the
userfaultfd is opened in blocking mode.

userfaults won't wait in "pending" state to be read anymore and any
UFFDIO_WAKE or similar operations that has the objective of waking
userfaults after their resolution, will wake all blocked userfaults
for the resolved range, including those that haven't been read() by
userland yet.

The behavior of poll() becomes not standard, but this obviates the
need of "spurious" UFFDIO_WAKE and it lets the userland threads to
restart immediately without requiring an UFFDIO_WAKE. This is even
more significant in case of repeated faults on the same address from
multiple threads.

This optimization is justified by the measurement that the number of
spurious UFFDIO_WAKE accounts for 5% and 10% of the total
userfaults for heavy workloads, so it's worth optimizing those away.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Andrea Arcangeli
a9b85f9415 userfaultfd: change the read API to return a uffd_msg
I had requests to return the full address (not the page aligned one) to
userland.

It's not entirely clear how the page offset could be relevant because
userfaults aren't like SIGBUS that can sigjump to a different place and it
actually skip resolving the fault depending on a page offset.  There's
currently no real way to skip the fault especially because after a
UFFDIO_COPY|ZEROPAGE, the fault is optimized to be retried within the
kernel without having to return to userland first (not even self modifying
code replacing the .text that touched the faulting address would prevent
the fault to be repeated).  Userland cannot skip repeating the fault even
more so if the fault was triggered by a KVM secondary page fault or any
get_user_pages or any copy-user inside some syscall which will return to
kernel code.  The second time FAULT_FLAG_RETRY_NOWAIT won't be set leading
to a SIGBUS being raised because the userfault can't wait if it cannot
release the mmap_map first (and FAULT_FLAG_RETRY_NOWAIT is required for
that).

Still returning userland a proper structure during the read() on the uffd,
can allow to use the current UFFD_API for the future non-cooperative
extensions too and it looks cleaner as well.  Once we get additional
fields there's no point to return the fault address page aligned anymore
to reuse the bits below PAGE_SHIFT.

The only downside is that the read() syscall will read 32bytes instead of
8bytes but that's not going to be measurable overhead.

The total number of new events that can be extended or of new future bits
for already shipped events, is limited to 64 by the features field of the
uffdio_api structure.  If more will be needed a bump of UFFD_API will be
required.

[akpm@linux-foundation.org: use __packed]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Pavel Emelyanov
3f602d2724 userfaultfd: Rename uffd_api.bits into .features
This is (seems to be) the minimal thing that is required to unblock
standard uffd usage from the non-cooperative one.  Now more bits can be
added to the features field indicating e.g.  UFFD_FEATURE_FORK and others
needed for the latter use-case.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Andrea Arcangeli
86039bd3b4 userfaultfd: add new syscall to provide memory externalization
Once an userfaultfd has been created and certain region of the process
virtual address space have been registered into it, the thread responsible
for doing the memory externalization can manage the page faults in
userland by talking to the kernel using the userfaultfd protocol.

poll() can be used to know when there are new pending userfaults to be
read (POLLIN).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00