Commit Graph

484 Commits

Author SHA1 Message Date
Marcelo Henrique Cerri
dc787bb53d proc: Avoid mixing integer types in mem_rw()
[ Upstream commit d238692b4b ]

Use size_t when capping the count argument received by mem_rw(). Since
count is size_t, using min_t(int, ...) can lead to a negative value
that will later be passed to access_remote_vm(), which can cause
unexpected behavior.

Since we are capping the value to at maximum PAGE_SIZE, the conversion
from size_t to int when passing it to access_remote_vm() as "len"
shouldn't be a problem.

Link: https://lkml.kernel.org/r/20210512125215.3348316-1-marcelo.cerri@canonical.com
Reviewed-by: David Disseldorp <ddiss@suse.de>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Signed-off-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Souza Cascardo <cascardo@canonical.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-16 11:34:11 +09:00
Linus Torvalds
83cac9ce25 proc: only require mm_struct for writing
commit 94f0b2d4a1 upstream.

Commit 591a22c14d ("proc: Track /proc/$pid/attr/ opener mm_struct") we
started using __mem_open() to track the mm_struct at open-time, so that
we could then check it for writes.

But that also ended up making the permission checks at open time much
stricter - and not just for writes, but for reads too.  And that in turn
caused a regression for at least Fedora 29, where NIC interfaces fail to
start when using NetworkManager.

Since only the write side wanted the mm_struct test, ignore any failures
by __mem_open() at open time, leaving reads unaffected.  The write()
time verification of the mm_struct pointer will then catch the failure
case because a NULL pointer will not match a valid 'current->mm'.

Link: https://lore.kernel.org/netdev/YMjTlp2FSJYvoyFa@unreal/
Fixes: 591a22c14d ("proc: Track /proc/$pid/attr/ opener mm_struct")
Reported-and-tested-by: Leon Romanovsky <leon@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-16 11:14:30 +09:00
Kees Cook
16dbfc32a9 proc: Track /proc/$pid/attr/ opener mm_struct
commit 591a22c14d upstream.

Commit bfb819ea20 ("proc: Check /proc/$pid/attr/ writes against file opener")
tried to make sure that there could not be a confusion between the opener of
a /proc/$pid/attr/ file and the writer. It used struct cred to make sure
the privileges didn't change. However, there were existing cases where a more
privileged thread was passing the opened fd to a differently privileged thread
(during container setup). Instead, use mm_struct to track whether the opener
and writer are still the same process. (This is what several other proc files
already do, though for different reasons.)

Reported-by: Christian Brauner <christian.brauner@ubuntu.com>
Reported-by: Andrea Righi <andrea.righi@canonical.com>
Tested-by: Andrea Righi <andrea.righi@canonical.com>
Fixes: bfb819ea20 ("proc: Check /proc/$pid/attr/ writes against file opener")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-16 11:12:54 +09:00
Kees Cook
5e910e1d03 proc: Check /proc/$pid/attr/ writes against file opener
commit bfb819ea20 upstream.

Fix another "confused deputy" weakness[1]. Writes to /proc/$pid/attr/
files need to check the opener credentials, since these fds do not
transition state across execve(). Without this, it is possible to
trick another process (which may have different credentials) to write
to its own /proc/$pid/attr/ files, leading to unexpected and possibly
exploitable behaviors.

[1] https://www.kernel.org/doc/html/latest/security/credentials.html?highlight=confused#open-file-credentials

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-16 11:09:39 +09:00
Michal Hocko
e1b2af0221 proc, oom: do not report alien mms when setting oom_score_adj
commit b2b469939e upstream.

Tetsuo has reported that creating a thousands of processes sharing MM
without SIGHAND (aka alien threads) and setting
/proc/<pid>/oom_score_adj will swamp the kernel log and takes ages [1]
to finish.  This is especially worrisome that all that printing is done
under RCU lock and this can potentially trigger RCU stall or softlockup
detector.

The primary reason for the printk was to catch potential users who might
depend on the behavior prior to 44a70adec9 ("mm, oom_adj: make sure
processes sharing mm have same view of oom_score_adj") but after more
than 2 years without a single report I guess it is safe to simply remove
the printk altogether.

The next step should be moving oom_score_adj over to the mm struct and
remove all the tasks crawling as suggested by [2]

[1] http://lkml.kernel.org/r/97fce864-6f75-bca5-14bc-12c9f890e740@i-love.sakura.ne.jp
[2] http://lkml.kernel.org/r/20190117155159.GA4087@dhcp22.suse.cz

Link: http://lkml.kernel.org/r/20190212102129.26288-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Yong-Taek Lee <ytk.lee@samsung.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-15 11:44:06 +09:00
Jann Horn
20f1b93fac proc: restrict kernel stack dumps to root
commit f8a00cef17 upstream.

Currently, you can use /proc/self/task/*/stack to cause a stack walk on
a task you control while it is running on another CPU.  That means that
the stack can change under the stack walker.  The stack walker does
have guards against going completely off the rails and into random
kernel memory, but it can interpret random data from your kernel stack
as instruction pointers and stack pointers.  This can cause exposure of
kernel stack contents to userspace.

Restrict the ability to inspect kernel stacks of arbitrary tasks to root
in order to prevent a local attacker from exploiting racy stack unwinding
to leak kernel task stack contents.  See the added comment for a longer
rationale.

There don't seem to be any users of this userspace API that can't
gracefully bail out if reading from the file fails.  Therefore, I believe
that this change is unlikely to break things.  In the case that this patch
does end up needing a revert, the next-best solution might be to fake a
single-entry stack based on wchan.

Link: http://lkml.kernel.org/r/20180927153316.200286-1-jannh@google.com
Fixes: 2ec220e27f ("proc: add /proc/*/stack")
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Ken Chen <kenchen@google.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-15 08:34:25 +09:00
Greg Kroah-Hartman
9797dcb8c7 Merge 4.9.104 into android-4.9
Changes in 4.9.104
	MIPS: c-r4k: Fix data corruption related to cache coherence
	MIPS: ptrace: Expose FIR register through FP regset
	MIPS: Fix ptrace(2) PTRACE_PEEKUSR and PTRACE_POKEUSR accesses to o32 FGRs
	KVM: Fix spelling mistake: "cop_unsuable" -> "cop_unusable"
	affs_lookup(): close a race with affs_remove_link()
	aio: fix io_destroy(2) vs. lookup_ioctx() race
	ALSA: timer: Fix pause event notification
	do d_instantiate/unlock_new_inode combinations safely
	mmc: sdhci-iproc: remove hard coded mmc cap 1.8v
	mmc: sdhci-iproc: fix 32bit writes for TRANSFER_MODE register
	libata: Blacklist some Sandisk SSDs for NCQ
	libata: blacklist Micron 500IT SSD with MU01 firmware
	xen-swiotlb: fix the check condition for xen_swiotlb_free_coherent
	drm/vmwgfx: Fix 32-bit VMW_PORT_HB_[IN|OUT] macros
	IB/hfi1: Use after free race condition in send context error path
	Revert "ipc/shm: Fix shmat mmap nil-page protection"
	ipc/shm: fix shmat() nil address after round-down when remapping
	kasan: fix memory hotplug during boot
	kernel/sys.c: fix potential Spectre v1 issue
	kernel/signal.c: avoid undefined behaviour in kill_something_info
	KVM/VMX: Expose SSBD properly to guests
	KVM: s390: vsie: fix < 8k check for the itdba
	KVM: x86: Update cpuid properly when CR4.OSXAVE or CR4.PKE is changed
	kvm: x86: IA32_ARCH_CAPABILITIES is always supported
	firewire-ohci: work around oversized DMA reads on JMicron controllers
	x86/tsc: Allow TSC calibration without PIT
	NFSv4: always set NFS_LOCK_LOST when a lock is lost.
	ALSA: hda - Use IS_REACHABLE() for dependency on input
	kvm: x86: fix KVM_XEN_HVM_CONFIG ioctl
	netfilter: ipv6: nf_defrag: Pass on packets to stack per RFC2460
	tracing/hrtimer: Fix tracing bugs by taking all clock bases and modes into account
	PCI: Add function 1 DMA alias quirk for Marvell 9128
	Input: psmouse - fix Synaptics detection when protocol is disabled
	i40iw: Zero-out consumer key on allocate stag for FMR
	tools lib traceevent: Simplify pointer print logic and fix %pF
	perf callchain: Fix attr.sample_max_stack setting
	tools lib traceevent: Fix get_field_str() for dynamic strings
	perf record: Fix failed memory allocation for get_cpuid_str
	iommu/vt-d: Use domain instead of cache fetching
	dm thin: fix documentation relative to low water mark threshold
	net: stmmac: dwmac-meson8b: fix setting the RGMII TX clock on Meson8b
	net: stmmac: dwmac-meson8b: propagate rate changes to the parent clock
	nfs: Do not convert nfs_idmap_cache_timeout to jiffies
	watchdog: sp5100_tco: Fix watchdog disable bit
	kconfig: Don't leak main menus during parsing
	kconfig: Fix automatic menu creation mem leak
	kconfig: Fix expr_free() E_NOT leak
	mac80211_hwsim: fix possible memory leak in hwsim_new_radio_nl()
	ipmi/powernv: Fix error return code in ipmi_powernv_probe()
	Btrfs: set plug for fsync
	btrfs: Fix out of bounds access in btrfs_search_slot
	Btrfs: fix scrub to repair raid6 corruption
	btrfs: fail mount when sb flag is not in BTRFS_SUPER_FLAG_SUPP
	HID: roccat: prevent an out of bounds read in kovaplus_profile_activated()
	fm10k: fix "failed to kill vid" message for VF
	device property: Define type of PROPERTY_ENRTY_*() macros
	jffs2: Fix use-after-free bug in jffs2_iget()'s error handling path
	powerpc/numa: Use ibm,max-associativity-domains to discover possible nodes
	powerpc/numa: Ensure nodes initialized for hotplug
	RDMA/mlx5: Avoid memory leak in case of XRCD dealloc failure
	ntb_transport: Fix bug with max_mw_size parameter
	gianfar: prevent integer wrapping in the rx handler
	tcp_nv: fix potential integer overflow in tcpnv_acked
	kvm: Map PFN-type memory regions as writable (if possible)
	ocfs2: return -EROFS to mount.ocfs2 if inode block is invalid
	ocfs2/acl: use 'ip_xattr_sem' to protect getting extended attribute
	ocfs2: return error when we attempt to access a dirty bh in jbd2
	mm/mempolicy: fix the check of nodemask from user
	mm/mempolicy: add nodes_empty check in SYSC_migrate_pages
	asm-generic: provide generic_pmdp_establish()
	sparc64: update pmdp_invalidate() to return old pmd value
	mm: thp: use down_read_trylock() in khugepaged to avoid long block
	mm: pin address_space before dereferencing it while isolating an LRU page
	mm/fadvise: discard partial page if endbyte is also EOF
	openvswitch: Remove padding from packet before L3+ conntrack processing
	IB/ipoib: Fix for potential no-carrier state
	drm/nouveau/pmu/fuc: don't use movw directly anymore
	netfilter: ipv6: nf_defrag: Kill frag queue on RFC2460 failure
	x86/power: Fix swsusp_arch_resume prototype
	firmware: dmi_scan: Fix handling of empty DMI strings
	ACPI: processor_perflib: Do not send _PPC change notification if not ready
	ACPI / scan: Use acpi_bus_get_status() to initialize ACPI_TYPE_DEVICE devs
	bpf: fix selftests/bpf test_kmod.sh failure when CONFIG_BPF_JIT_ALWAYS_ON=y
	MIPS: generic: Fix machine compatible matching
	MIPS: TXx9: use IS_BUILTIN() for CONFIG_LEDS_CLASS
	xen-netfront: Fix race between device setup and open
	xen/grant-table: Use put_page instead of free_page
	RDS: IB: Fix null pointer issue
	arm64: spinlock: Fix theoretical trylock() A-B-A with LSE atomics
	proc: fix /proc/*/map_files lookup
	cifs: silence compiler warnings showing up with gcc-8.0.0
	bcache: properly set task state in bch_writeback_thread()
	bcache: fix for allocator and register thread race
	bcache: fix for data collapse after re-attaching an attached device
	bcache: return attach error when no cache set exist
	tools/libbpf: handle issues with bpf ELF objects containing .eh_frames
	bpf: fix rlimit in reuseport net selftest
	vfs/proc/kcore, x86/mm/kcore: Fix SMAP fault when dumping vsyscall user page
	locking/qspinlock: Ensure node->count is updated before initialising node
	irqchip/gic-v3: Ignore disabled ITS nodes
	cpumask: Make for_each_cpu_wrap() available on UP as well
	irqchip/gic-v3: Change pr_debug message to pr_devel
	ARC: Fix malformed ARC_EMUL_UNALIGNED default
	ptr_ring: prevent integer overflow when calculating size
	libata: Fix compile warning with ATA_DEBUG enabled
	selftests: pstore: Adding config fragment CONFIG_PSTORE_RAM=m
	selftests: memfd: add config fragment for fuse
	ARM: OMAP2+: timer: fix a kmemleak caused in omap_get_timer_dt
	ARM: OMAP3: Fix prm wake interrupt for resume
	ARM: OMAP1: clock: Fix debugfs_create_*() usage
	ibmvnic: Free RX socket buffer in case of adapter error
	iwlwifi: mvm: fix security bug in PN checking
	iwlwifi: mvm: always init rs with 20mhz bandwidth rates
	NFC: llcp: Limit size of SDP URI
	rxrpc: Work around usercopy check
	mac80211: round IEEE80211_TX_STATUS_HEADROOM up to multiple of 4
	mac80211: fix a possible leak of station stats
	mac80211: fix calling sleeping function in atomic context
	mac80211: Do not disconnect on invalid operating class
	md raid10: fix NULL deference in handle_write_completed()
	drm/exynos: g2d: use monotonic timestamps
	drm/exynos: fix comparison to bitshift when dealing with a mask
	locking/xchg/alpha: Add unconditional memory barrier to cmpxchg()
	md: raid5: avoid string overflow warning
	kernel/relay.c: limit kmalloc size to KMALLOC_MAX_SIZE
	powerpc/bpf/jit: Fix 32-bit JIT for seccomp_data access
	s390/cio: fix ccw_device_start_timeout API
	s390/cio: fix return code after missing interrupt
	s390/cio: clear timer when terminating driver I/O
	PKCS#7: fix direct verification of SignerInfo signature
	ARM: OMAP: Fix dmtimer init for omap1
	smsc75xx: fix smsc75xx_set_features()
	regulatory: add NUL to request alpha2
	integrity/security: fix digsig.c build error with header file
	locking/xchg/alpha: Fix xchg() and cmpxchg() memory ordering bugs
	x86/topology: Update the 'cpu cores' field in /proc/cpuinfo correctly across CPU hotplug operations
	mac80211: drop frames with unexpected DS bits from fast-rx to slow path
	arm64: fix unwind_frame() for filtered out fn for function graph tracing
	macvlan: fix use-after-free in macvlan_common_newlink()
	kvm: fix warning for CONFIG_HAVE_KVM_EVENTFD builds
	fs: dcache: Avoid livelock between d_alloc_parallel and __d_add
	fs: dcache: Use READ_ONCE when accessing i_dir_seq
	md: fix a potential deadlock of raid5/raid10 reshape
	md/raid1: fix NULL pointer dereference
	batman-adv: fix packet checksum in receive path
	batman-adv: invalidate checksum on fragment reassembly
	netfilter: ebtables: convert BUG_ONs to WARN_ONs
	batman-adv: Ignore invalid batadv_iv_gw during netlink send
	batman-adv: Ignore invalid batadv_v_gw during netlink send
	batman-adv: Fix netlink dumping of BLA claims
	batman-adv: Fix netlink dumping of BLA backbones
	nvme-pci: Fix nvme queue cleanup if IRQ setup fails
	clocksource/drivers/fsl_ftm_timer: Fix error return checking
	ceph: fix dentry leak when failing to init debugfs
	ARM: orion5x: Revert commit 4904dbda41.
	qrtr: add MODULE_ALIAS macro to smd
	r8152: fix tx packets accounting
	virtio-gpu: fix ioctl and expose the fixed status to userspace.
	dmaengine: rcar-dmac: fix max_chunk_size for R-Car Gen3
	bcache: fix kcrashes with fio in RAID5 backend dev
	ip6_tunnel: fix IFLA_MTU ignored on NEWLINK
	sit: fix IFLA_MTU ignored on NEWLINK
	ARM: dts: NSP: Fix amount of RAM on BCM958625HR
	powerpc/boot: Fix random libfdt related build errors
	gianfar: Fix Rx byte accounting for ndev stats
	net/tcp/illinois: replace broken algorithm reference link
	nvmet: fix PSDT field check in command format
	xen/pirq: fix error path cleanup when binding MSIs
	drm/sun4i: Fix dclk_set_phase
	Btrfs: send, fix issuing write op when processing hole in no data mode
	selftests/powerpc: Skip the subpage_prot tests if the syscall is unavailable
	KVM: PPC: Book3S HV: Fix VRMA initialization with 2MB or 1GB memory backing
	iwlwifi: mvm: fix TX of CCMP 256
	watchdog: f71808e_wdt: Fix magic close handling
	watchdog: sbsa: use 32-bit read for WCV
	batman-adv: Fix multicast packet loss with a single WANT_ALL_IPV4/6 flag
	e1000e: Fix check_for_link return value with autoneg off
	e1000e: allocate ring descriptors with dma_zalloc_coherent
	ia64/err-inject: Use get_user_pages_fast()
	RDMA/qedr: Fix kernel panic when running fio over NFSoRDMA
	RDMA/qedr: Fix iWARP write and send with immediate
	IB/mlx4: Fix corruption of RoCEv2 IPv4 GIDs
	IB/mlx4: Include GID type when deleting GIDs from HW table under RoCE
	IB/mlx5: Fix an error code in __mlx5_ib_modify_qp()
	fbdev: Fixing arbitrary kernel leak in case FBIOGETCMAP_SPARC in sbusfb_ioctl_helper().
	fsl/fman: avoid sleeping in atomic context while adding an address
	net: qcom/emac: Use proper free methods during TX
	net: smsc911x: Fix unload crash when link is up
	IB/core: Fix possible crash to access NULL netdev
	xen: xenbus: use put_device() instead of kfree()
	arm64: Relax ARM_SMCCC_ARCH_WORKAROUND_1 discovery
	dmaengine: mv_xor_v2: Fix clock resource by adding a register clock
	netfilter: ebtables: fix erroneous reject of last rule
	bnxt_en: Check valid VNIC ID in bnxt_hwrm_vnic_set_tpa().
	workqueue: use put_device() instead of kfree()
	ipv4: lock mtu in fnhe when received PMTU < net.ipv4.route.min_pmtu
	sunvnet: does not support GSO for sctp
	drm/imx: move arming of the vblank event to atomic_flush
	net: Fix vlan untag for bridge and vlan_dev with reorder_hdr off
	batman-adv: fix header size check in batadv_dbg_arp()
	batman-adv: Fix skbuff rcsum on packet reroute
	vti4: Don't count header length twice on tunnel setup
	vti4: Don't override MTU passed on link creation via IFLA_MTU
	perf/cgroup: Fix child event counting bug
	brcmfmac: Fix check for ISO3166 code
	kbuild: make scripts/adjust_autoksyms.sh robust against timestamp races
	RDMA/ucma: Correct option size check using optlen
	RDMA/qedr: fix QP's ack timeout configuration
	RDMA/qedr: Fix rc initialization on CNQ allocation failure
	mm/mempolicy.c: avoid use uninitialized preferred_node
	mm, thp: do not cause memcg oom for thp
	selftests: ftrace: Add probe event argument syntax testcase
	selftests: ftrace: Add a testcase for string type with kprobe_event
	selftests: ftrace: Add a testcase for probepoint
	batman-adv: fix multicast-via-unicast transmission with AP isolation
	batman-adv: fix packet loss for broadcasted DHCP packets to a server
	ARM: 8748/1: mm: Define vdso_start, vdso_end as array
	net: qmi_wwan: add BroadMobi BM806U 2020:2033
	perf/x86/intel: Fix linear IP of PEBS real_ip on Haswell and later CPUs
	llc: properly handle dev_queue_xmit() return value
	builddeb: Fix header package regarding dtc source links
	mm/kmemleak.c: wait for scan completion before disabling free
	net: Fix untag for vlan packets without ethernet header
	net: mvneta: fix enable of all initialized RXQs
	sh: fix debug trap failure to process signals before return to user
	nvme: don't send keep-alives to the discovery controller
	x86/pgtable: Don't set huge PUD/PMD on non-leaf entries
	x86/mm: Do not forbid _PAGE_RW before init for __ro_after_init
	fs/proc/proc_sysctl.c: fix potential page fault while unregistering sysctl table
	swap: divide-by-zero when zero length swap file on ssd
	sr: get/drop reference to device in revalidate and check_events
	Force log to disk before reading the AGF during a fstrim
	cpufreq: CPPC: Initialize shared perf capabilities of CPUs
	dp83640: Ensure against premature access to PHY registers after reset
	mm/ksm: fix interaction with THP
	mm: fix races between address_space dereference and free in page_evicatable
	Btrfs: bail out on error during replay_dir_deletes
	Btrfs: fix NULL pointer dereference in log_dir_items
	btrfs: Fix possible softlock on single core machines
	ocfs2/dlm: don't handle migrate lockres if already in shutdown
	sched/rt: Fix rq->clock_update_flags < RQCF_ACT_SKIP warning
	KVM: VMX: raise internal error for exception during invalid protected mode state
	fscache: Fix hanging wait on page discarded by writeback
	sparc64: Make atomic_xchg() an inline function rather than a macro.
	net: bgmac: Fix endian access in bgmac_dma_tx_ring_free()
	btrfs: tests/qgroup: Fix wrong tree backref level
	Btrfs: fix copy_items() return value when logging an inode
	btrfs: fix lockdep splat in btrfs_alloc_subvolume_writers
	rxrpc: Fix Tx ring annotation after initial Tx failure
	rxrpc: Don't treat call aborts as conn aborts
	xen/acpi: off by one in read_acpi_id()
	drivers: macintosh: rack-meter: really fix bogus memsets
	ACPI: acpi_pad: Fix memory leak in power saving threads
	powerpc/mpic: Check if cpu_possible() in mpic_physmask()
	m68k: set dma and coherent masks for platform FEC ethernets
	parisc/pci: Switch LBA PCI bus from Hard Fail to Soft Fail mode
	hwmon: (nct6775) Fix writing pwmX_mode
	powerpc/perf: Prevent kernel address leak to userspace via BHRB buffer
	powerpc/perf: Fix kernel address leak via sampling registers
	tools/thermal: tmon: fix for segfault
	selftests: Print the test we're running to /dev/kmsg
	net/mlx5: Protect from command bit overflow
	ath10k: Fix kernel panic while using worker (ath10k_sta_rc_update_wk)
	cxgb4: Setup FW queues before registering netdev
	ima: Fallback to the builtin hash algorithm
	virtio-net: Fix operstate for virtio when no VIRTIO_NET_F_STATUS
	arm: dts: socfpga: fix GIC PPI warning
	cpufreq: cppc_cpufreq: Fix cppc_cpufreq_init() failure path
	zorro: Set up z->dev.dma_mask for the DMA API
	bcache: quit dc->writeback_thread when BCACHE_DEV_DETACHING is set
	ACPICA: Events: add a return on failure from acpi_hw_register_read
	ACPICA: acpi: acpica: fix acpi operand cache leak in nseval.c
	cxgb4: Fix queue free path of ULD drivers
	i2c: mv64xxx: Apply errata delay only in standard mode
	KVM: lapic: stop advertising DIRECTED_EOI when in-kernel IOAPIC is in use
	perf top: Fix top.call-graph config option reading
	perf stat: Fix core dump when flag T is used
	IB/core: Honor port_num while resolving GID for IB link layer
	regulator: gpio: Fix some error handling paths in 'gpio_regulator_probe()'
	spi: bcm-qspi: fIX some error handling paths
	MIPS: ath79: Fix AR724X_PLL_REG_PCIE_CONFIG offset
	PCI: Restore config space on runtime resume despite being unbound
	ipmi_ssif: Fix kernel panic at msg_done_handler
	powerpc: Add missing prototype for arch_irq_work_raise()
	f2fs: fix to check extent cache in f2fs_drop_extent_tree
	perf/core: Fix perf_output_read_group()
	drm/panel: simple: Fix the bus format for the Ontat panel
	hwmon: (pmbus/max8688) Accept negative page register values
	hwmon: (pmbus/adm1275) Accept negative page register values
	perf/x86/intel: Properly save/restore the PMU state in the NMI handler
	cdrom: do not call check_disk_change() inside cdrom_open()
	perf/x86/intel: Fix large period handling on Broadwell CPUs
	perf/x86/intel: Fix event update for auto-reload
	arm64: dts: qcom: Fix SPI5 config on MSM8996
	soc: qcom: wcnss_ctrl: Fix increment in NV upload
	gfs2: Fix fallocate chunk size
	x86/devicetree: Initialize device tree before using it
	x86/devicetree: Fix device IRQ settings in DT
	ALSA: vmaster: Propagate slave error
	dmaengine: pl330: fix a race condition in case of threaded irqs
	dmaengine: rcar-dmac: Check the done lists in rcar_dmac_chan_get_residue()
	enic: enable rq before updating rq descriptors
	hwrng: stm32 - add reset during probe
	dmaengine: qcom: bam_dma: get num-channels and num-ees from dt
	net: stmmac: ensure that the device has released ownership before reading data
	net: stmmac: ensure that the MSS desc is the last desc to set the own bit
	cpufreq: Reorder cpufreq_online() error code path
	PCI: Add function 1 DMA alias quirk for Marvell 88SE9220
	udf: Provide saner default for invalid uid / gid
	ARM: dts: bcm283x: Fix probing of bcm2835-i2s
	audit: return on memory error to avoid null pointer dereference
	rcu: Call touch_nmi_watchdog() while printing stall warnings
	pinctrl: sh-pfc: r8a7796: Fix MOD_SEL register pin assignment for SSI pins group
	MIPS: Octeon: Fix logging messages with spurious periods after newlines
	drm/rockchip: Respect page offset for PRIME mmap calls
	x86/apic: Set up through-local-APIC mode on the boot CPU if 'noapic' specified
	perf tests: Use arch__compare_symbol_names to compare symbols
	perf report: Fix memory corruption in --branch-history mode --branch-history
	selftests/net: fixes psock_fanout eBPF test case
	netlabel: If PF_INET6, check sk_buff ip header version
	regmap: Correct comparison in regmap_cached
	ARM: dts: imx7d: cl-som-imx7: fix pinctrl_enet
	ARM: dts: porter: Fix HDMI output routing
	regulator: of: Add a missing 'of_node_put()' in an error handling path of 'of_regulator_match()'
	pinctrl: msm: Use dynamic GPIO numbering
	kdb: make "mdr" command repeat
	Linux 4.9.104

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-05-30 13:19:56 +02:00
Alexey Dobriyan
e0a1a0173a proc: fix /proc/*/map_files lookup
[ Upstream commit ac7f1061c2 ]

Current code does:

	if (sscanf(dentry->d_name.name, "%lx-%lx", start, end) != 2)

However sscanf() is broken garbage.

It silently accepts whitespace between format specifiers
(did you know that?).

It silently accepts valid strings which result in integer overflow.

Do not use sscanf() for any even remotely reliable parsing code.

	OK
	# readlink '/proc/1/map_files/55a23af39000-55a23b05b000'
	/lib/systemd/systemd

	broken
	# readlink '/proc/1/map_files/               55a23af39000-55a23b05b000'
	/lib/systemd/systemd

	broken
	# readlink '/proc/1/map_files/55a23af39000-55a23b05b000    '
	/lib/systemd/systemd

	very broken
	# readlink '/proc/1/map_files/1000000000000000055a23af39000-55a23b05b000'
	/lib/systemd/systemd

Andrei said:

: This patch breaks criu.  It was a bug in criu.  And this bug is on a minor
: path, which works when memfd_create() isn't available.  It is a reason why
: I ask to not backport this patch to stable kernels.
:
: In CRIU this bug can be triggered, only if this patch will be backported
: to a kernel which version is lower than v3.16.

Link: http://lkml.kernel.org/r/20171120212706.GA14325@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30 07:50:25 +02:00
Greg Kroah-Hartman
aef17a58e8 Merge 4.9.101 into android-4.9
Changes in 4.9.101
	8139too: Use disable_irq_nosync() in rtl8139_poll_controller()
	bridge: check iface upper dev when setting master via ioctl
	dccp: fix tasklet usage
	ipv4: fix memory leaks in udp_sendmsg, ping_v4_sendmsg
	llc: better deal with too small mtu
	net: ethernet: sun: niu set correct packet size in skb
	net: ethernet: ti: cpsw: fix packet leaking in dual_mac mode
	net/mlx4_en: Verify coalescing parameters are in range
	net/mlx5: E-Switch, Include VF RDMA stats in vport statistics
	net_sched: fq: take care of throttled flows before reuse
	net: support compat 64-bit time in {s,g}etsockopt
	openvswitch: Don't swap table in nlattr_set() after OVS_ATTR_NESTED is found
	qmi_wwan: do not steal interfaces from class drivers
	r8169: fix powering up RTL8168h
	sctp: handle two v4 addrs comparison in sctp_inet6_cmp_addr
	sctp: remove sctp_chunk_put from fail_mark err path in sctp_ulpevent_make_rcvmsg
	sctp: use the old asoc when making the cookie-ack chunk in dupcook_d
	tcp_bbr: fix to zero idle_restart only upon S/ACKed data
	tg3: Fix vunmap() BUG_ON() triggered from tg3_free_consistent().
	bonding: do not allow rlb updates to invalid mac
	net/mlx5: Avoid cleaning flow steering table twice during error flow
	bonding: send learning packets for vlans on slave
	tcp: ignore Fast Open on repair mode
	sctp: fix the issue that the cookie-ack with auth can't get processed
	sctp: delay the authentication for the duplicated cookie-echo chunk
	serial: sccnxp: Fix error handling in sccnxp_probe()
	futex: Remove duplicated code and fix undefined behaviour
	xfrm: fix xfrm_do_migrate() with AEAD e.g(AES-GCM)
	lockd: lost rollback of set_grace_period() in lockd_down_net()
	Revert "ARM: dts: imx6qdl-wandboard: Fix audio channel swap"
	l2tp: revert "l2tp: fix missing print session offset info"
	nfp: TX time stamp packets before HW doorbell is rung
	proc: do not access cmdline nor environ from file-backed areas
	futex: futex_wake_op, fix sign_extend32 sign bits
	kernel/exit.c: avoid undefined behaviour when calling wait4()
	Linux 4.9.101

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-05-19 14:06:17 +02:00
Willy Tarreau
6f1abf8628 proc: do not access cmdline nor environ from file-backed areas
commit 7f7ccc2ccc upstream.

proc_pid_cmdline_read() and environ_read() directly access the target
process' VM to retrieve the command line and environment. If this
process remaps these areas onto a file via mmap(), the requesting
process may experience various issues such as extra delays if the
underlying device is slow to respond.

Let's simply refuse to access file-backed areas in these functions.
For this we add a new FOLL_ANON gup flag that is passed to all calls
to access_remote_vm(). The code already takes care of such failures
(including unmapped areas). Accesses via /proc/pid/mem were not
changed though.

This was assigned CVE-2018-1120.

Note for stable backports: the patch may apply to kernels prior to 4.11
but silently miss one location; it must be checked that no call to
access_remote_vm() keeps zero as the last argument.

Reported-by: Qualys Security Advisory <qsa@qualys.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-19 10:27:00 +02:00
Greg Hackmann
3a6594d7d1 Revert "ANDROID: proc: make oom adjustment files user read-only"
CTS no longer expects oom_{adj,score_adj} to be read-only.  See
https://android-review.googlesource.com/530687/ for additional context.

This reverts commit 2956c9be7e.

Bug: 63142211
Change-Id: I9ff2728531eb2b09b5ddb19da51eacf6e4316dd6
Signed-off-by: Greg Hackmann <ghackmann@google.com>
2018-04-13 22:22:23 +00:00
Greg Hackmann
4405a1f744 Revert "ANDROID: fixup! proc: make oom adjustment files user read-only"
CTS no longer expects oom_{adj,score_adj} to be read-only.  See
https://android-review.googlesource.com/530687/ for additional context.

This reverts commit f049c419ae.

Bug: 63142211
Change-Id: I83c42123921e8dd0395420125e69246c4d85cda6
Signed-off-by: Greg Hackmann <ghackmann@google.com>
2018-04-13 22:22:07 +00:00
Connor O'Brien
6e7b83d80b ANDROID: cpufreq: track per-task time in state
Add time in state data to task structs, and create
/proc/<pid>/time_in_state files to show how long each individual task
has run at each frequency.
Create a CONFIG_CPU_FREQ_TIMES option to enable/disable this tracking.

Signed-off-by: Connor O'Brien <connoro@google.com>
Bug: 72339335
Bug: 70951257
Test: Read /proc/<pid>/time_in_state
Change-Id: Ia6456754f4cb1e83b2bc35efa8fbe9f8696febc8
2018-04-03 11:15:30 -07:00
Greg Kroah-Hartman
0d21cf1656 Merge 4.9.33 into android-4.9
Changes in 4.9.33
	PCI/PM: Add needs_resume flag to avoid suspend complete optimization
	drm/i915: Prevent the system suspend complete optimization
	partitions/msdos: FreeBSD UFS2 file systems are not recognized
	netfilter: nf_conntrack_sip: fix wrong memory initialisation
	ibmvnic: Fix endian errors in error reporting output
	ibmvnic: Fix endian error when requesting device capabilities
	net: xilinx_emaclite: fix freezes due to unordered I/O
	net: xilinx_emaclite: fix receive buffer overflow
	tcp: tcp_probe: use spin_lock_bh()
	ipv6: Handle IPv4-mapped src to in6addr_any dst.
	ipv6: Inhibit IPv4-mapped src address on the wire.
	tipc: Fix tipc_sk_reinit race conditions
	gfs2: Use rhashtable walk interface in glock_hash_walk
	NET: Fix /proc/net/arp for AX.25
	ibmvnic: Call napi_disable instead of napi_enable in failure path
	ibmvnic: Initialize completion variables before starting work
	NET: mkiss: Fix panic
	net: hns: Fix the device being used for dma mapping during TX
	sierra_net: Skip validating irrelevant fields for IDLE LSIs
	sierra_net: Add support for IPv6 and Dual-Stack Link Sense Indications
	i2c: piix4: Request the SMBUS semaphore inside the mutex
	i2c: piix4: Fix request_region size
	powerpc/powernv: Properly set "host-ipi" on IPIs
	kernel/ucount.c: mark user_header with kmemleak_ignore()
	net: thunderx: Fix PHY autoneg for SGMII QLM mode
	ipv6: addrconf: fix generation of new temporary addresses
	vfio/spapr_tce: Set window when adding additional groups to container
	ipv6: Fix IPv6 packet loss in scenarios involving roaming + snooping switches
	ARM: defconfigs: make NF_CT_PROTO_SCTP and NF_CT_PROTO_UDPLITE built-in
	PM / runtime: Avoid false-positive warnings from might_sleep_if()
	jump label: pass kbuild_cflags when checking for asm goto support
	shmem: fix sleeping from atomic context
	kasan: respect /proc/sys/kernel/traceoff_on_warning
	log2: make order_base_2() behave correctly on const input value zero
	ethtool: do not vzalloc(0) on registers dump
	net: phy: Fix lack of reference count on PHY driver
	net: phy: Fix PHY module checks and NULL deref in phy_attach_direct()
	net: fix ndo_features_check/ndo_fix_features comment ordering
	fscache: Fix dead object requeue
	fscache: Clear outstanding writes when disabling a cookie
	FS-Cache: Initialise stores_lock in netfs cookie
	ipv6: fix flow labels when the traffic class is non-0
	drm/nouveau: prevent userspace from deleting client object
	drm/nouveau/fence/g84-: protect against concurrent access to semaphore buffers
	net/mlx4_core: Avoid command timeouts during VF driver device shutdown
	gianfar: synchronize DMA API usage by free_skb_rx_queue w/ gfar_new_page
	pinctrl: baytrail: Rectify debounce support (part 2)
	cec: fix wrong last_la determination
	drm: prevent double-(un)registration for connectors
	drm: Don't race connector registration
	pinctrl: berlin-bg4ct: fix the value for "sd1a" of pin SCRD0_CRD_PRES
	net: adaptec: starfire: add checks for dma mapping errors
	drm/i915: Check for NULL i915_vma in intel_unpin_fb_obj()
	net/mlx5: E-Switch, Err when retrieving steering name-space fails
	net/mlx5: Return EOPNOTSUPP when failing to get steering name-space
	parisc, parport_gsc: Fixes for printk continuation lines
	net: phy: micrel: add support for KSZ8795
	gtp: add genl family modules alias
	drm/nouveau: Intercept ACPI_VIDEO_NOTIFY_PROBE
	drm/nouveau: Rename acpi_work to hpd_work
	drm/nouveau: Handle fbcon suspend/resume in seperate worker
	drm/nouveau: Don't enabling polling twice on runtime resume
	drm/nouveau: Fix drm poll_helper handling
	drm/ast: Fixed system hanged if disable P2A
	ravb: unmap descriptors when freeing rings
	nfs: Fix "Don't increment lock sequence ID after NFS4ERR_MOVED"
	nvmet-rdma: Fix missing dma sync to nvme data structures
	r8152: avoid start_xmit to call napi_schedule during autosuspend
	r8152: check rx after napi is enabled
	r8152: re-schedule napi for tx
	r8152: fix rtl8152_post_reset function
	r8152: avoid start_xmit to schedule napi when napi is disabled
	net-next: ethernet: mediatek: change the compatible string
	bnxt_en: Fix bnxt_reset() in the slow path task.
	bnxt_en: Enhance autoneg support.
	bnxt_en: Fix RTNL lock usage on bnxt_update_link().
	bnxt_en: Fix RTNL lock usage on bnxt_get_port_module_status().
	sctp: sctp gso should set feature with NETIF_F_SG when calling skb_segment
	sctp: sctp_addr_id2transport should verify the addr before looking up assoc
	usb: musb: Fix external abort on non-linefetch for musb_irq_work()
	mn10300: fix build error of missing fpu_save()
	romfs: use different way to generate fsid for BLOCK or MTD
	frv: add atomic64_add_unless()
	frv: add missing atomic64 operations
	proc: add a schedule point in proc_pid_readdir()
	userfaultfd: fix SIGBUS resulting from false rwsem wakeups
	kernel/watchdog.c: move hardlockup detector to separate file
	kernel/watchdog.c: move shared definitions to nmi.h
	kernel/watchdog: prevent false hardlockup on overloaded system
	vhost/vsock: handle vhost_vq_init_access() error
	ARC: smp-boot: Decouple Non masters waiting API from jump to entry point
	ARCv2: smp-boot: wake_flag polling by non-Masters needs to be uncached
	tipc: ignore requests when the connection state is not CONNECTED
	tipc: fix connection refcount error
	tipc: add subscription refcount to avoid invalid delete
	tipc: fix nametbl_lock soft lockup at node/link events
	netfilter: nf_tables: fix set->nelems counting with no NLM_F_EXCL
	netfilter: nft_log: restrict the log prefix length to 127
	RDMA/qedr: Dispatch port active event from qedr_add
	RDMA/qedr: Fix and simplify memory leak in PD alloc
	RDMA/qedr: Don't reset QP when queues aren't flushed
	RDMA/qedr: Don't spam dmesg if QP is in error state
	RDMA/qedr: Return max inline data in QP query result
	xtensa: don't use linux IRQ #0
	s390/kvm: do not rely on the ILC on kvm host protection fauls
	drm/i915: Workaround VLV/CHV DSI scanline counter hardware fail
	drm/i915: Always recompute watermarks when distrust_bios_wm is set, v2.
	sparc64: make string buffers large enough
	Linux 4.9.33

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2017-06-27 10:43:04 +02:00
Eric Dumazet
9618fba264 proc: add a schedule point in proc_pid_readdir()
[ Upstream commit 3ba4bceef2 ]

We have seen proc_pid_readdir() invocations holding cpu for more than 50
ms.  Add a cond_resched() to be gentle with other tasks.

[akpm@linux-foundation.org: coding style fix]
Link: http://lkml.kernel.org/r/1484238380.15816.42.camel@edumazet-glaptop3.roam.corp.google.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-17 06:41:56 +02:00
Dan Willemsen
f049c419ae ANDROID: fixup! proc: make oom adjustment files user read-only
Fix the build by removing the duplicate line that uses the obsolete INF
macro.

Signed-off-by: Dan Willemsen <dwillemsen@nvidia.com>
2017-01-27 13:51:58 -08:00
Rom Lemarchand
2956c9be7e ANDROID: proc: make oom adjustment files user read-only
Make oom_adj and oom_score_adj user read-only.

Bug: 19636629
Change-Id: I055bb172d5b4d3d856e25918f3c5de8edf31e4a3
Signed-off-by: Rom Lemarchand <romlem@google.com>
2017-01-27 13:51:57 -08:00
Leon Yu
06b2849d10 proc: fix NULL dereference when reading /proc/<pid>/auxv
Reading auxv of any kernel thread results in NULL pointer dereferencing
in auxv_read() where mm can be NULL.  Fix that by checking for NULL mm
and bailing out early.  This is also the original behavior changed by
recent commit c531716785 ("proc: switch auxv to use of __mem_open()").

  # cat /proc/2/auxv
  Unable to handle kernel NULL pointer dereference at virtual address 000000a8
  Internal error: Oops: 17 [#1] PREEMPT SMP ARM
  CPU: 3 PID: 113 Comm: cat Not tainted 4.9.0-rc1-ARCH+ #1
  Hardware name: BCM2709
  task: ea3b0b00 task.stack: e99b2000
  PC is at auxv_read+0x24/0x4c
  LR is at do_readv_writev+0x2fc/0x37c
  Process cat (pid: 113, stack limit = 0xe99b2210)
  Call chain:
    auxv_read
    do_readv_writev
    vfs_readv
    default_file_splice_read
    splice_direct_to_actor
    do_splice_direct
    do_sendfile
    SyS_sendfile64
    ret_fast_syscall

Fixes: c531716785 ("proc: switch auxv to use of __mem_open()")
Link: http://lkml.kernel.org/r/1476966200-14457-1-git-send-email-chianglungyu@gmail.com
Signed-off-by: Leon Yu <chianglungyu@gmail.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Janis Danisevskis <jdanis@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-27 18:43:43 -07:00
Linus Torvalds
272ddc8b37 proc: don't use FOLL_FORCE for reading cmdline and environment
Now that Lorenzo cleaned things up and made the FOLL_FORCE users
explicit, it becomes obvious how some of them don't really need
FOLL_FORCE at all.

So remove FOLL_FORCE from the proc code that reads the command line and
arguments from user space.

The mem_rw() function actually does want FOLL_FORCE, because gdd (and
possibly many other debuggers) use it as a much more convenient version
of PTRACE_PEEKDATA, but we should consider making the FOLL_FORCE part
conditional on actually being a ptracer.  This does not actually do
that, just moves adds a comment to that effect and moves the gup_flags
settings next to each other.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-24 19:00:44 -07:00
Lorenzo Stoakes
6347e8d5bc mm: replace access_remote_vm() write parameter with gup_flags
This removes the 'write' argument from access_remote_vm() and replaces
it with 'gup_flags' as use of this function previously silently implied
FOLL_FORCE, whereas after this patch callers explicitly pass this flag.

We make this explicit as use of FOLL_FORCE can result in surprising
behaviour (and hence bugs) within the mm subsystem.

Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-19 08:12:14 -07:00
Linus Torvalds
101105b171 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 ">rename2() work from Miklos + current_time() from Deepa"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: Replace current_fs_time() with current_time()
  fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
  fs: Replace CURRENT_TIME with current_time() for inode timestamps
  fs: proc: Delete inode time initializations in proc_alloc_inode()
  vfs: Add current_time() api
  vfs: add note about i_op->rename changes to porting
  fs: rename "rename2" i_op to "rename"
  vfs: remove unused i_op->rename
  fs: make remaining filesystems use .rename2
  libfs: support RENAME_NOREPLACE in simple_rename()
  fs: support RENAME_NOREPLACE for local filesystems
  ncpfs: fix unused variable warning
2016-10-10 20:16:43 -07:00
Linus Torvalds
abb5a14fa2 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Assorted misc bits and pieces.

  There are several single-topic branches left after this (rename2
  series from Miklos, current_time series from Deepa Dinamani, xattr
  series from Andreas, uaccess stuff from from me) and I'd prefer to
  send those separately"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits)
  proc: switch auxv to use of __mem_open()
  hpfs: support FIEMAP
  cifs: get rid of unused arguments of CIFSSMBWrite()
  posix_acl: uapi header split
  posix_acl: xattr representation cleanups
  fs/aio.c: eliminate redundant loads in put_aio_ring_file
  fs/internal.h: add const to ns_dentry_operations declaration
  compat: remove compat_printk()
  fs/buffer.c: make __getblk_slow() static
  proc: unsigned file descriptors
  fs/file: more unsigned file descriptors
  fs: compat: remove redundant check of nr_segs
  cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]
  cifs: don't use memcpy() to copy struct iov_iter
  get rid of separate multipage fault-in primitives
  fs: Avoid premature clearing of capabilities
  fs: Give dentry to inode_change_ok() instead of inode
  fuse: Propagate dentry down to inode_change_ok()
  ceph: Propagate dentry down to inode_change_ok()
  xfs: Propagate dentry down to inode_change_ok()
  ...
2016-10-10 13:04:49 -07:00
Al Viro
e55f1d1d13 Merge remote-tracking branch 'jk/vfs' into work.misc 2016-10-08 11:06:08 -04:00
John Stultz
4b2bd5fec0 proc: fix timerslack_ns CAP_SYS_NICE check when adjusting self
In changing from checking ptrace_may_access(p, PTRACE_MODE_ATTACH_FSCREDS)
to capable(CAP_SYS_NICE), I missed that ptrace_my_access succeeds when p
== current, but the CAP_SYS_NICE doesn't.

Thus while the previous commit was intended to loosen the needed
privileges to modify a processes timerslack, it needlessly restricted a
task modifying its own timerslack via the proc/<tid>/timerslack_ns
(which is permitted also via the PR_SET_TIMERSLACK method).

This patch corrects this by checking if p == current before checking the
CAP_SYS_NICE value.

This patch applies on top of my two previous patches currently in -mm

Link: http://lkml.kernel.org/r/1471906870-28624-1-git-send-email-john.stultz@linaro.org
Signed-off-by: John Stultz <john.stultz@linaro.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Oren Laadan <orenl@cellrox.com>
Cc: Ruchi Kandoi <kandoiruchi@google.com>
Cc: Rom Lemarchand <romlem@android.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Colin Cross <ccross@android.com>
Cc: Nick Kralevich <nnk@google.com>
Cc: Dmitry Shmidt <dimitrysh@google.com>
Cc: Elliott Hughes <enh@google.com>
Cc: Android Kernel Team <kernel-team@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:30 -07:00
John Stultz
904763e1fb proc: add LSM hook checks to /proc/<tid>/timerslack_ns
As requested, this patch checks the existing LSM hooks
task_getscheduler/task_setscheduler when reading or modifying the task's
timerslack value.

Previous versions added new get/settimerslack LSM hooks, but since they
checked the same PROCESS__SET/GETSCHED values as existing hooks, it was
suggested we just use the existing ones.

Link: http://lkml.kernel.org/r/1469132667-17377-2-git-send-email-john.stultz@linaro.org
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Oren Laadan <orenl@cellrox.com>
Cc: Ruchi Kandoi <kandoiruchi@google.com>
Cc: Rom Lemarchand <romlem@android.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Colin Cross <ccross@android.com>
Cc: Nick Kralevich <nnk@google.com>
Cc: Dmitry Shmidt <dimitrysh@google.com>
Cc: Elliott Hughes <enh@google.com>
Cc: James Morris <jmorris@namei.org>
Cc: Android Kernel Team <kernel-team@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:30 -07:00
John Stultz
7abbaf9404 proc: relax /proc/<tid>/timerslack_ns capability requirements
When an interface to allow a task to change another tasks timerslack was
first proposed, it was suggested that something greater then
CAP_SYS_NICE would be needed, as a task could be delayed further then
what normally could be done with nice adjustments.

So CAP_SYS_PTRACE was adopted instead for what became the
/proc/<tid>/timerslack_ns interface.  However, for Android (where this
feature originates), giving the system_server CAP_SYS_PTRACE would allow
it to observe and modify all tasks memory.  This is considered too high
a privilege level for only needing to change the timerslack.

After some discussion, it was realized that a CAP_SYS_NICE process can
set a task as SCHED_FIFO, so they could fork some spinning processes and
set them all SCHED_FIFO 99, in effect delaying all other tasks for an
infinite amount of time.

So as a CAP_SYS_NICE task can already cause trouble for other tasks,
using it as a required capability for accessing and modifying
/proc/<tid>/timerslack_ns seems sufficient.

Thus, this patch loosens the capability requirements to CAP_SYS_NICE and
removes CAP_SYS_PTRACE, simplifying some of the code flow as well.

This is technically an ABI change, but as the feature just landed in
4.6, I suspect no one is yet using it.

Link: http://lkml.kernel.org/r/1469132667-17377-1-git-send-email-john.stultz@linaro.org
Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Nick Kralevich <nnk@google.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Oren Laadan <orenl@cellrox.com>
Cc: Ruchi Kandoi <kandoiruchi@google.com>
Cc: Rom Lemarchand <romlem@android.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Colin Cross <ccross@android.com>
Cc: Nick Kralevich <nnk@google.com>
Cc: Dmitry Shmidt <dimitrysh@google.com>
Cc: Elliott Hughes <enh@google.com>
Cc: Android Kernel Team <kernel-team@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:30 -07:00
Al Viro
c531716785 proc: switch auxv to use of __mem_open()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-05 18:43:43 -04:00
Deepa Dinamani
078cd8279e fs: Replace CURRENT_TIME with current_time() for inode timestamps
CURRENT_TIME macro is not appropriate for filesystems as it
doesn't use the right granularity for filesystem timestamps.
Use current_time() instead.

CURRENT_TIME is also not y2038 safe.

This is also in preparation for the patch that transitions
vfs timestamps to use 64 bit time and hence make them
y2038 safe. As part of the effort current_time() will be
extended to do range checks. Hence, it is necessary for all
file system timestamps to use current_time(). Also,
current_time() will be transitioned along with vfs to be
y2038 safe.

Note that whenever a single call to current_time() is used
to change timestamps in different inodes, it is because they
share the same time granularity.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Felipe Balbi <balbi@kernel.org>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-27 21:06:21 -04:00
Jan Kara
31051c85b5 fs: Give dentry to inode_change_ok() instead of inode
inode_change_ok() will be resposible for clearing capabilities and IMA
extended attributes and as such will need dentry. Give it as an argument
to inode_change_ok() instead of an inode. Also rename inode_change_ok()
to setattr_prepare() to better relect that it does also some
modifications in addition to checks.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2016-09-22 10:56:19 +02:00
Ingo Molnar
d4b80afbba Merge branch 'linus' into x86/asm, to pick up recent fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-15 08:24:53 +02:00
Linus Torvalds
511a8cdb65 Merge branch 'stable-4.8' of git://git.infradead.org/users/pcmoore/audit
Pull audit fixes from Paul Moore:
 "Two small patches to fix some bugs with the audit-by-executable
  functionality we introduced back in v4.3 (both patches are marked
  for the stable folks)"

* 'stable-4.8' of git://git.infradead.org/users/pcmoore/audit:
  audit: fix exe_file access in audit_exe_compare
  mm: introduce get_task_exe_file
2016-09-01 15:55:56 -07:00
Mateusz Guzik
cd81a9170e mm: introduce get_task_exe_file
For more convenient access if one has a pointer to the task.

As a minor nit take advantage of the fact that only task lock + rcu are
needed to safely grab ->exe_file. This saves mm refcount dance.

Use the helper in proc_exe_link.

Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Richard Guy Briggs <rgb@redhat.com>
Cc: <stable@vger.kernel.org> # 4.3.x
Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-08-31 16:11:20 -04:00
Josh Poimboeuf
8b927d7341 proc: Fix return address printk conversion specifer in /proc/<pid>/stack
When printing call return addresses found on a stack, /proc/<pid>/stack
can sometimes give a confusing result.  If the call instruction was the
last instruction in the function (which can happen when calling a
noreturn function), '%pS' will incorrectly display the name of the
function which happens to be next in the object code, rather than the
name of the actual calling function.

Use '%pB' instead, which was created for this exact purpose.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/47ad2821e5ebdbed1fbf83fb85424ae4fbdf8b6e.1471535549.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18 18:41:32 +02:00
Oleg Nesterov
ef419398b6 proc_oom_score: remove tasklist_lock and pid_alive()
This was needed before to ensure that ->signal != 0 and do_each_thread()
is safe, see commit b95c35e76b ("oom: fix the unsafe usage of
badness() in proc_oom_score()") for details.

Today tsk->signal can't go away and for_each_thread(tsk) is always safe.

Link: http://lkml.kernel.org/r/20160608211921.GA15508@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 17:31:41 -04:00
Michal Hocko
44a70adec9 mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj
oom_score_adj is shared for the thread groups (via struct signal) but this
is not sufficient to cover processes sharing mm (CLONE_VM without
CLONE_SIGHAND) and so we can easily end up in a situation when some
processes update their oom_score_adj and confuse the oom killer.  In the
worst case some of those processes might hide from the oom killer
altogether via OOM_SCORE_ADJ_MIN while others are eligible.  OOM killer
would then pick up those eligible but won't be allowed to kill others
sharing the same mm so the mm wouldn't release the mm and so the memory.

It would be ideal to have the oom_score_adj per mm_struct because that is
the natural entity OOM killer considers.  But this will not work because
some programs are doing

	vfork()
	set_oom_adj()
	exec()

We can achieve the same though.  oom_score_adj write handler can set the
oom_score_adj for all processes sharing the same mm if the task is not in
the middle of vfork.  As a result all the processes will share the same
oom_score_adj.  The current implementation is rather pessimistic and
checks all the existing processes by default if there is more than 1
holder of the mm but we do not have any reliable way to check for external
users yet.

Link: http://lkml.kernel.org/r/1466426628-15074-5-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Michal Hocko
1d5f0acbc6 proc, oom_adj: extract oom_score_adj setting into a helper
Currently we have two proc interfaces to set oom_score_adj.  The legacy
/proc/<pid>/oom_adj and /proc/<pid>/oom_score_adj which both have their
specific handlers.  Big part of the logic is duplicated so extract the
common code into __set_oom_adj helper.  Legacy knob still expects some
details slightly different so make sure those are handled same way - e.g.
the legacy mode ignores oom_score_adj_min and it warns about the usage.

This patch shouldn't introduce any functional changes.

Link: http://lkml.kernel.org/r/1466426628-15074-4-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Michal Hocko
f913da596a proc, oom: drop bogus sighand lock
Oleg has pointed out that can simplify both oom_adj_{read,write} and
oom_score_adj_{read,write} even further and drop the sighand lock.  The
main purpose of the lock was to protect p->signal from going away but this
will not happen since ea6d290ca3 ("signals: make task_struct->signal
immutable/refcountable").

The other role of the lock was to synchronize different writers,
especially those with CAP_SYS_RESOURCE.  Introduce a mutex for this
purpose.  Later patches will need this lock anyway.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Link: http://lkml.kernel.org/r/1466426628-15074-3-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Michal Hocko
d49fbf766d proc, oom: drop bogus task_lock and mm check
Series "Handle oom bypass more gracefully", V5

The following 10 patches should put some order to very rare cases of mm
shared between processes and make the paths which bypass the oom killer
oom reapable and therefore much more reliable finally.  Even though mm
shared outside of thread group is rare (either vforked tasks for a short
period, use_mm by kernel threads or exotic thread model of
clone(CLONE_VM) without CLONE_SIGHAND) it is better to cover them.  Not
only it makes the current oom killer logic quite hard to follow and
reason about it can lead to weird corner cases.  E.g.  it is possible to
select an oom victim which shares the mm with unkillable process or
bypass the oom killer even when other processes sharing the mm are still
alive and other weird cases.

Patch 1 drops bogus task_lock and mm check from oom_{score_}adj_write.
This can be considered a bug fix with a low impact as nobody has noticed
for years.

Patch 2 drops sighand lock because it is not needed anymore as pointed
by Oleg.

Patch 3 is a clean up of oom_score_adj handling and a preparatory work
for later patches.

Patch 4 enforces oom_adj_score to be consistent between processes
sharing the mm to behave consistently with the regular thread groups.
This can be considered a user visible behavior change because one thread
group updating oom_score_adj will affect others which share the same mm
via clone(CLONE_VM).  I argue that this should be acceptable because we
already have the same behavior for threads in the same thread group and
sharing the mm without signal struct is just a different model of
threading.  This is probably the most controversial part of the series,
I would like to find some consensus here.  There were some suggestions
to hook some counter/oom_score_adj into the mm_struct but I feel that
this is not necessary right now and we can rely on proc handler +
oom_kill_process to DTRT.  I can be convinced otherwise but I strongly
think that whatever we do the userspace has to have a way to see the
current oom priority as consistently as possible.

Patch 5 makes sure that no vforked task is selected if it is sharing the
mm with oom unkillable task.

Patch 6 ensures that all user tasks sharing the mm are killed which in
turn makes sure that all oom victims are oom reapable.

Patch 7 guarantees that task_will_free_mem will always imply reapable
bypass of the oom killer.

Patch 8 is new in this version and it addresses an issue pointed out by
0-day OOM report where an oom victim was reaped several times.

Patch 9 puts an upper bound on how many times oom_reaper tries to reap a
task and hides it from the oom killer to move on when no progress can be
made.  This will give an upper bound to how long an oom_reapable task
can block the oom killer from selecting another victim if the oom_reaper
is not able to reap the victim.

Patch 10 tries to plug the (hopefully) last hole when we can still lock
up when the oom victim is shared with oom unkillable tasks (kthreads and
global init).  We just try to be best effort in that case and rather
fallback to kill something else than risk a lockup.

This patch (of 10):

Both oom_adj_write and oom_score_adj_write are using task_lock, check for
task->mm and fail if it is NULL.  This is not needed because the
oom_score_adj is per signal struct so we do not need mm at all.  The code
has been introduced by 3d5992d2ac ("oom: add per-mm oom disable count")
but we do not do per-mm oom disable since c9f01245b6 ("oom: remove
oom_disable_count").

The task->mm check is even not correct because the current thread might
have exited but the thread group might be still alive - e.g.  thread group
leader would lead that echo $VAL > /proc/pid/oom_score_adj would always
fail with EINVAL while /proc/pid/task/$other_tid/oom_score_adj would
succeed.  This is unexpected at best.

Remove the lock along with the check to fix the unexpected behavior and
also because there is not real need for the lock in the first place.

Link: http://lkml.kernel.org/r/1466426628-15074-2-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Janis Danisevskis
1b3044e39a procfs: fix pthread cross-thread naming if !PR_DUMPABLE
The PR_DUMPABLE flag causes the pid related paths of the proc file
system to be owned by ROOT.

The implementation of pthread_set/getname_np however needs access to
/proc/<pid>/task/<tid>/comm.  If PR_DUMPABLE is false this
implementation is locked out.

This patch installs a special permission function for the file "comm"
that grants read and write access to all threads of the same group
regardless of the ownership of the inode.  For all other threads the
function falls back to the generic inode permission check.

[akpm@linux-foundation.org: fix spello in comment]
Signed-off-by: Janis Danisevskis <jdanis@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Minfei Huang <mnfhuang@gmail.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Calvin Owens <calvinowens@fb.com>
Cc: Jann Horn <jann@thejh.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Linus Torvalds
7f427d3a60 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull parallel filesystem directory handling update from Al Viro.

This is the main parallel directory work by Al that makes the vfs layer
able to do lookup and readdir in parallel within a single directory.
That's a big change, since this used to be all protected by the
directory inode mutex.

The inode mutex is replaced by an rwsem, and serialization of lookups of
a single name is done by a "in-progress" dentry marker.

The series begins with xattr cleanups, and then ends with switching
filesystems over to actually doing the readdir in parallel (switching to
the "iterate_shared()" that only takes the read lock).

A more detailed explanation of the process from Al Viro:
 "The xattr work starts with some acl fixes, then switches ->getxattr to
  passing inode and dentry separately.  This is the point where the
  things start to get tricky - that got merged into the very beginning
  of the -rc3-based #work.lookups, to allow untangling the
  security_d_instantiate() mess.  The xattr work itself proceeds to
  switch a lot of filesystems to generic_...xattr(); no complications
  there.

  After that initial xattr work, the series then does the following:

   - untangle security_d_instantiate()

   - convert a bunch of open-coded lookup_one_len_unlocked() to calls of
     that thing; one such place (in overlayfs) actually yields a trivial
     conflict with overlayfs fixes later in the cycle - overlayfs ended
     up switching to a variant of lookup_one_len_unlocked() sans the
     permission checks.  I would've dropped that commit (it gets
     overridden on merge from #ovl-fixes in #for-next; proper resolution
     is to use the variant in mainline fs/overlayfs/super.c), but I
     didn't want to rebase the damn thing - it was fairly late in the
     cycle...

   - some filesystems had managed to depend on lookup/lookup exclusion
     for *fs-internal* data structures in a way that would break if we
     relaxed the VFS exclusion.  Fixing hadn't been hard, fortunately.

   - core of that series - parallel lookup machinery, replacing
     ->i_mutex with rwsem, making lookup_slow() take it only shared.  At
     that point lookups happen in parallel; lookups on the same name
     wait for the in-progress one to be done with that dentry.

     Surprisingly little code, at that - almost all of it is in
     fs/dcache.c, with fs/namei.c changes limited to lookup_slow() -
     making it use the new primitive and actually switching to locking
     shared.

   - parallel readdir stuff - first of all, we provide the exclusion on
     per-struct file basis, same as we do for read() vs lseek() for
     regular files.  That takes care of most of the needed exclusion in
     readdir/readdir; however, these guys are trickier than lookups, so
     I went for switching them one-by-one.  To do that, a new method
     '->iterate_shared()' is added and filesystems are switched to it
     as they are either confirmed to be OK with shared lock on directory
     or fixed to be OK with that.  I hope to kill the original method
     come next cycle (almost all in-tree filesystems are switched
     already), but it's still not quite finished.

   - several filesystems get switched to parallel readdir.  The
     interesting part here is dealing with dcache preseeding by readdir;
     that needs minor adjustment to be safe with directory locked only
     shared.

     Most of the filesystems doing that got switched to in those
     commits.  Important exception: NFS.  Turns out that NFS folks, with
     their, er, insistence on VFS getting the fuck out of the way of the
     Smart Filesystem Code That Knows How And What To Lock(tm) have
     grown the locking of their own.  They had their own homegrown
     rwsem, with lookup/readdir/atomic_open being *writers* (sillyunlink
     is the reader there).  Of course, with VFS getting the fuck out of
     the way, as requested, the actual smarts of the smart filesystem
     code etc. had become exposed...

   - do_last/lookup_open/atomic_open cleanups.  As the result, open()
     without O_CREAT locks the directory only shared.  Including the
     ->atomic_open() case.  Backmerge from #for-linus in the middle of
     that - atomic_open() fix got brought in.

   - then comes NFS switch to saner (VFS-based ;-) locking, killing the
     homegrown "lookup and readdir are writers" kinda-sorta rwsem.  All
     exclusion for sillyunlink/lookup is done by the parallel lookups
     mechanism.  Exclusion between sillyunlink and rmdir is a real rwsem
     now - rmdir being the writer.

     Result: NFS lookups/readdirs/O_CREAT-less opens happen in parallel
     now.

   - the rest of the series consists of switching a lot of filesystems
     to parallel readdir; in a lot of cases ->llseek() gets simplified
     as well.  One backmerge in there (again, #for-linus - rockridge
     fix)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (74 commits)
  ext4: switch to ->iterate_shared()
  hfs: switch to ->iterate_shared()
  hfsplus: switch to ->iterate_shared()
  hostfs: switch to ->iterate_shared()
  hpfs: switch to ->iterate_shared()
  hpfs: handle allocation failures in hpfs_add_pos()
  gfs2: switch to ->iterate_shared()
  f2fs: switch to ->iterate_shared()
  afs: switch to ->iterate_shared()
  befs: switch to ->iterate_shared()
  befs: constify stuff a bit
  isofs: switch to ->iterate_shared()
  get_acorn_filename(): deobfuscate a bit
  btrfs: switch to ->iterate_shared()
  logfs: no need to lock directory in lseek
  switch ecryptfs to ->iterate_shared
  9p: switch to ->iterate_shared()
  fat: switch to ->iterate_shared()
  romfs, squashfs: switch to ->iterate_shared()
  more trivial ->iterate_shared conversions
  ...
2016-05-17 11:01:31 -07:00
Al Viro
0e0162bb8c Merge branch 'ovl-fixes' into for-linus
Backmerge to resolve a conflict in ovl_lookup_real();
"ovl_lookup_real(): use lookup_one_len_unlocked()" instead,
but it was too late in the cycle to rebase.
2016-05-17 02:17:59 -04:00
Robin Humble
1e92a61c4c Revert "proc/base: make prompt shell start from new line after executing "cat /proc/$pid/wchan""
This reverts the 4.6-rc1 commit 7e2bc81da3 ("proc/base: make prompt
shell start from new line after executing "cat /proc/$pid/wchan")
because it breaks /proc/$PID/whcan formatting in ps and top.

Revert also because the patch is inconsistent - it adds a newline at the
end of only the '0' wchan, and does not add a newline when
/proc/$PID/wchan contains a symbol name.

eg.
$ ps -eo pid,stat,wchan,comm
PID STAT WCHAN  COMMAND
...
1189 S    -      dbus-launch
1190 Ssl  0
dbus-daemon
1198 Sl   0
lightdm
1299 Ss   ep_pol systemd
1301 S    -      (sd-pam)
1304 Ss   wait   sh

Signed-off-by: Robin Humble <plaguedbypenguins@gmail.com>
Cc: Minfei Huang <mnfhuang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-09 17:40:59 -07:00
Mathias Krause
8148a73c99 proc: prevent accessing /proc/<PID>/environ until it's ready
If /proc/<PID>/environ gets read before the envp[] array is fully set up
in create_{aout,elf,elf_fdpic,flat}_tables(), we might end up trying to
read more bytes than are actually written, as env_start will already be
set but env_end will still be zero, making the range calculation
underflow, allowing to read beyond the end of what has been written.

Fix this as it is done for /proc/<PID>/cmdline by testing env_end for
zero.  It is, apparently, intentionally set last in create_*_tables().

This bug was found by the PaX size_overflow plugin that detected the
arithmetic underflow of 'this_len = env_end - (env_start + src)' when
env_end is still zero.

The expected consequence is that userland trying to access
/proc/<PID>/environ of a not yet fully set up process may get
inconsistent data as we're in the middle of copying in the environment
variables.

Fixes: https://forums.grsecurity.net/viewtopic.php?f=3&t=4363
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=116461
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Emese Revfy <re.emese@gmail.com>
Cc: Pax Team <pageexec@freemail.hu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Jarod Wilson <jarod@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-05 17:38:53 -07:00
Al Viro
f50752eaa0 switch all procfs directories ->iterate_shared()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:49:30 -04:00
Al Viro
3781764b5c proc_fill_cache(): switch to d_alloc_parallel()
... making it usable with directory locked shared

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:49:29 -04:00
Minfei Huang
7e2bc81da3 proc/base: make prompt shell start from new line after executing "cat /proc/$pid/wchan"
It is not elegant that prompt shell does not start from new line after
executing "cat /proc/$pid/wchan".  Make prompt shell start from new
line.

Signed-off-by: Minfei Huang <mnfhuang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Eric Engestrom
b5946beaa9 procfs: add conditional compilation check
`proc_timers_operations` is only used when CONFIG_CHECKPOINT_RESTORE is
enabled.

Signed-off-by: Eric Engestrom <eric.engestrom@imgtec.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
John Stultz
5de23d435e proc: add /proc/<pid>/timerslack_ns interface
This patch provides a proc/PID/timerslack_ns interface which exposes a
task's timerslack value in nanoseconds and allows it to be changed.

This allows power/performance management software to set timer slack for
other threads according to its policy for the thread (such as when the
thread is designated foreground vs.  background activity)

If the value written is non-zero, slack is set to that value.  Otherwise
sets it to the default for the thread.

This interface checks that the calling task has permissions to to use
PTRACE_MODE_ATTACH_FSCREDS on the target task, so that we can ensure
arbitrary apps do not change the timer slack for other apps.

Signed-off-by: John Stultz <john.stultz@linaro.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Oren Laadan <orenl@cellrox.com>
Cc: Ruchi Kandoi <kandoiruchi@google.com>
Cc: Rom Lemarchand <romlem@android.com>
Cc: Android Kernel Team <kernel-team@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Mateusz Guzik
a3b609ef9f proc read mm's {arg,env}_{start,end} with mmap semaphore taken.
Only functions doing more than one read are modified.  Consumeres
happened to deal with possibly changing data, but it does not seem like
a good thing to rely on.

Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Jarod Wilson <jarod@redhat.com>
Cc: Jan Stancek <jstancek@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anshuman Khandual <anshuman.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00
Jann Horn
caaee6234d ptrace: use fsuid, fsgid, effective creds for fs access checks
By checking the effective credentials instead of the real UID / permitted
capabilities, ensure that the calling process actually intended to use its
credentials.

To ensure that all ptrace checks use the correct caller credentials (e.g.
in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
flag), use two new flags and require one of them to be set.

The problem was that when a privileged task had temporarily dropped its
privileges, e.g.  by calling setreuid(0, user_uid), with the intent to
perform following syscalls with the credentials of a user, it still passed
ptrace access checks that the user would not be able to pass.

While an attacker should not be able to convince the privileged task to
perform a ptrace() syscall, this is a problem because the ptrace access
check is reused for things in procfs.

In particular, the following somewhat interesting procfs entries only rely
on ptrace access checks:

 /proc/$pid/stat - uses the check for determining whether pointers
     should be visible, useful for bypassing ASLR
 /proc/$pid/maps - also useful for bypassing ASLR
 /proc/$pid/cwd - useful for gaining access to restricted
     directories that contain files with lax permissions, e.g. in
     this scenario:
     lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
     drwx------ root root /root
     drwxr-xr-x root root /root/foobar
     -rw-r--r-- root root /root/foobar/secret

Therefore, on a system where a root-owned mode 6755 binary changes its
effective credentials as described and then dumps a user-specified file,
this could be used by an attacker to reveal the memory layout of root's
processes or reveal the contents of files he is not allowed to access
(through /proc/$pid/cwd).

[akpm@linux-foundation.org: fix warning]
Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00