commit 94f0b2d4a1 upstream.
Commit 591a22c14d ("proc: Track /proc/$pid/attr/ opener mm_struct") we
started using __mem_open() to track the mm_struct at open-time, so that
we could then check it for writes.
But that also ended up making the permission checks at open time much
stricter - and not just for writes, but for reads too. And that in turn
caused a regression for at least Fedora 29, where NIC interfaces fail to
start when using NetworkManager.
Since only the write side wanted the mm_struct test, ignore any failures
by __mem_open() at open time, leaving reads unaffected. The write()
time verification of the mm_struct pointer will then catch the failure
case because a NULL pointer will not match a valid 'current->mm'.
Link: https://lore.kernel.org/netdev/YMjTlp2FSJYvoyFa@unreal/
Fixes: 591a22c14d ("proc: Track /proc/$pid/attr/ opener mm_struct")
Reported-and-tested-by: Leon Romanovsky <leon@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 591a22c14d upstream.
Commit bfb819ea20 ("proc: Check /proc/$pid/attr/ writes against file opener")
tried to make sure that there could not be a confusion between the opener of
a /proc/$pid/attr/ file and the writer. It used struct cred to make sure
the privileges didn't change. However, there were existing cases where a more
privileged thread was passing the opened fd to a differently privileged thread
(during container setup). Instead, use mm_struct to track whether the opener
and writer are still the same process. (This is what several other proc files
already do, though for different reasons.)
Reported-by: Christian Brauner <christian.brauner@ubuntu.com>
Reported-by: Andrea Righi <andrea.righi@canonical.com>
Tested-by: Andrea Righi <andrea.righi@canonical.com>
Fixes: bfb819ea20 ("proc: Check /proc/$pid/attr/ writes against file opener")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f8a00cef17 upstream.
Currently, you can use /proc/self/task/*/stack to cause a stack walk on
a task you control while it is running on another CPU. That means that
the stack can change under the stack walker. The stack walker does
have guards against going completely off the rails and into random
kernel memory, but it can interpret random data from your kernel stack
as instruction pointers and stack pointers. This can cause exposure of
kernel stack contents to userspace.
Restrict the ability to inspect kernel stacks of arbitrary tasks to root
in order to prevent a local attacker from exploiting racy stack unwinding
to leak kernel task stack contents. See the added comment for a longer
rationale.
There don't seem to be any users of this userspace API that can't
gracefully bail out if reading from the file fails. Therefore, I believe
that this change is unlikely to break things. In the case that this patch
does end up needing a revert, the next-best solution might be to fake a
single-entry stack based on wchan.
Link: http://lkml.kernel.org/r/20180927153316.200286-1-jannh@google.com
Fixes: 2ec220e27f ("proc: add /proc/*/stack")
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Ken Chen <kenchen@google.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Changes in 4.9.104
MIPS: c-r4k: Fix data corruption related to cache coherence
MIPS: ptrace: Expose FIR register through FP regset
MIPS: Fix ptrace(2) PTRACE_PEEKUSR and PTRACE_POKEUSR accesses to o32 FGRs
KVM: Fix spelling mistake: "cop_unsuable" -> "cop_unusable"
affs_lookup(): close a race with affs_remove_link()
aio: fix io_destroy(2) vs. lookup_ioctx() race
ALSA: timer: Fix pause event notification
do d_instantiate/unlock_new_inode combinations safely
mmc: sdhci-iproc: remove hard coded mmc cap 1.8v
mmc: sdhci-iproc: fix 32bit writes for TRANSFER_MODE register
libata: Blacklist some Sandisk SSDs for NCQ
libata: blacklist Micron 500IT SSD with MU01 firmware
xen-swiotlb: fix the check condition for xen_swiotlb_free_coherent
drm/vmwgfx: Fix 32-bit VMW_PORT_HB_[IN|OUT] macros
IB/hfi1: Use after free race condition in send context error path
Revert "ipc/shm: Fix shmat mmap nil-page protection"
ipc/shm: fix shmat() nil address after round-down when remapping
kasan: fix memory hotplug during boot
kernel/sys.c: fix potential Spectre v1 issue
kernel/signal.c: avoid undefined behaviour in kill_something_info
KVM/VMX: Expose SSBD properly to guests
KVM: s390: vsie: fix < 8k check for the itdba
KVM: x86: Update cpuid properly when CR4.OSXAVE or CR4.PKE is changed
kvm: x86: IA32_ARCH_CAPABILITIES is always supported
firewire-ohci: work around oversized DMA reads on JMicron controllers
x86/tsc: Allow TSC calibration without PIT
NFSv4: always set NFS_LOCK_LOST when a lock is lost.
ALSA: hda - Use IS_REACHABLE() for dependency on input
kvm: x86: fix KVM_XEN_HVM_CONFIG ioctl
netfilter: ipv6: nf_defrag: Pass on packets to stack per RFC2460
tracing/hrtimer: Fix tracing bugs by taking all clock bases and modes into account
PCI: Add function 1 DMA alias quirk for Marvell 9128
Input: psmouse - fix Synaptics detection when protocol is disabled
i40iw: Zero-out consumer key on allocate stag for FMR
tools lib traceevent: Simplify pointer print logic and fix %pF
perf callchain: Fix attr.sample_max_stack setting
tools lib traceevent: Fix get_field_str() for dynamic strings
perf record: Fix failed memory allocation for get_cpuid_str
iommu/vt-d: Use domain instead of cache fetching
dm thin: fix documentation relative to low water mark threshold
net: stmmac: dwmac-meson8b: fix setting the RGMII TX clock on Meson8b
net: stmmac: dwmac-meson8b: propagate rate changes to the parent clock
nfs: Do not convert nfs_idmap_cache_timeout to jiffies
watchdog: sp5100_tco: Fix watchdog disable bit
kconfig: Don't leak main menus during parsing
kconfig: Fix automatic menu creation mem leak
kconfig: Fix expr_free() E_NOT leak
mac80211_hwsim: fix possible memory leak in hwsim_new_radio_nl()
ipmi/powernv: Fix error return code in ipmi_powernv_probe()
Btrfs: set plug for fsync
btrfs: Fix out of bounds access in btrfs_search_slot
Btrfs: fix scrub to repair raid6 corruption
btrfs: fail mount when sb flag is not in BTRFS_SUPER_FLAG_SUPP
HID: roccat: prevent an out of bounds read in kovaplus_profile_activated()
fm10k: fix "failed to kill vid" message for VF
device property: Define type of PROPERTY_ENRTY_*() macros
jffs2: Fix use-after-free bug in jffs2_iget()'s error handling path
powerpc/numa: Use ibm,max-associativity-domains to discover possible nodes
powerpc/numa: Ensure nodes initialized for hotplug
RDMA/mlx5: Avoid memory leak in case of XRCD dealloc failure
ntb_transport: Fix bug with max_mw_size parameter
gianfar: prevent integer wrapping in the rx handler
tcp_nv: fix potential integer overflow in tcpnv_acked
kvm: Map PFN-type memory regions as writable (if possible)
ocfs2: return -EROFS to mount.ocfs2 if inode block is invalid
ocfs2/acl: use 'ip_xattr_sem' to protect getting extended attribute
ocfs2: return error when we attempt to access a dirty bh in jbd2
mm/mempolicy: fix the check of nodemask from user
mm/mempolicy: add nodes_empty check in SYSC_migrate_pages
asm-generic: provide generic_pmdp_establish()
sparc64: update pmdp_invalidate() to return old pmd value
mm: thp: use down_read_trylock() in khugepaged to avoid long block
mm: pin address_space before dereferencing it while isolating an LRU page
mm/fadvise: discard partial page if endbyte is also EOF
openvswitch: Remove padding from packet before L3+ conntrack processing
IB/ipoib: Fix for potential no-carrier state
drm/nouveau/pmu/fuc: don't use movw directly anymore
netfilter: ipv6: nf_defrag: Kill frag queue on RFC2460 failure
x86/power: Fix swsusp_arch_resume prototype
firmware: dmi_scan: Fix handling of empty DMI strings
ACPI: processor_perflib: Do not send _PPC change notification if not ready
ACPI / scan: Use acpi_bus_get_status() to initialize ACPI_TYPE_DEVICE devs
bpf: fix selftests/bpf test_kmod.sh failure when CONFIG_BPF_JIT_ALWAYS_ON=y
MIPS: generic: Fix machine compatible matching
MIPS: TXx9: use IS_BUILTIN() for CONFIG_LEDS_CLASS
xen-netfront: Fix race between device setup and open
xen/grant-table: Use put_page instead of free_page
RDS: IB: Fix null pointer issue
arm64: spinlock: Fix theoretical trylock() A-B-A with LSE atomics
proc: fix /proc/*/map_files lookup
cifs: silence compiler warnings showing up with gcc-8.0.0
bcache: properly set task state in bch_writeback_thread()
bcache: fix for allocator and register thread race
bcache: fix for data collapse after re-attaching an attached device
bcache: return attach error when no cache set exist
tools/libbpf: handle issues with bpf ELF objects containing .eh_frames
bpf: fix rlimit in reuseport net selftest
vfs/proc/kcore, x86/mm/kcore: Fix SMAP fault when dumping vsyscall user page
locking/qspinlock: Ensure node->count is updated before initialising node
irqchip/gic-v3: Ignore disabled ITS nodes
cpumask: Make for_each_cpu_wrap() available on UP as well
irqchip/gic-v3: Change pr_debug message to pr_devel
ARC: Fix malformed ARC_EMUL_UNALIGNED default
ptr_ring: prevent integer overflow when calculating size
libata: Fix compile warning with ATA_DEBUG enabled
selftests: pstore: Adding config fragment CONFIG_PSTORE_RAM=m
selftests: memfd: add config fragment for fuse
ARM: OMAP2+: timer: fix a kmemleak caused in omap_get_timer_dt
ARM: OMAP3: Fix prm wake interrupt for resume
ARM: OMAP1: clock: Fix debugfs_create_*() usage
ibmvnic: Free RX socket buffer in case of adapter error
iwlwifi: mvm: fix security bug in PN checking
iwlwifi: mvm: always init rs with 20mhz bandwidth rates
NFC: llcp: Limit size of SDP URI
rxrpc: Work around usercopy check
mac80211: round IEEE80211_TX_STATUS_HEADROOM up to multiple of 4
mac80211: fix a possible leak of station stats
mac80211: fix calling sleeping function in atomic context
mac80211: Do not disconnect on invalid operating class
md raid10: fix NULL deference in handle_write_completed()
drm/exynos: g2d: use monotonic timestamps
drm/exynos: fix comparison to bitshift when dealing with a mask
locking/xchg/alpha: Add unconditional memory barrier to cmpxchg()
md: raid5: avoid string overflow warning
kernel/relay.c: limit kmalloc size to KMALLOC_MAX_SIZE
powerpc/bpf/jit: Fix 32-bit JIT for seccomp_data access
s390/cio: fix ccw_device_start_timeout API
s390/cio: fix return code after missing interrupt
s390/cio: clear timer when terminating driver I/O
PKCS#7: fix direct verification of SignerInfo signature
ARM: OMAP: Fix dmtimer init for omap1
smsc75xx: fix smsc75xx_set_features()
regulatory: add NUL to request alpha2
integrity/security: fix digsig.c build error with header file
locking/xchg/alpha: Fix xchg() and cmpxchg() memory ordering bugs
x86/topology: Update the 'cpu cores' field in /proc/cpuinfo correctly across CPU hotplug operations
mac80211: drop frames with unexpected DS bits from fast-rx to slow path
arm64: fix unwind_frame() for filtered out fn for function graph tracing
macvlan: fix use-after-free in macvlan_common_newlink()
kvm: fix warning for CONFIG_HAVE_KVM_EVENTFD builds
fs: dcache: Avoid livelock between d_alloc_parallel and __d_add
fs: dcache: Use READ_ONCE when accessing i_dir_seq
md: fix a potential deadlock of raid5/raid10 reshape
md/raid1: fix NULL pointer dereference
batman-adv: fix packet checksum in receive path
batman-adv: invalidate checksum on fragment reassembly
netfilter: ebtables: convert BUG_ONs to WARN_ONs
batman-adv: Ignore invalid batadv_iv_gw during netlink send
batman-adv: Ignore invalid batadv_v_gw during netlink send
batman-adv: Fix netlink dumping of BLA claims
batman-adv: Fix netlink dumping of BLA backbones
nvme-pci: Fix nvme queue cleanup if IRQ setup fails
clocksource/drivers/fsl_ftm_timer: Fix error return checking
ceph: fix dentry leak when failing to init debugfs
ARM: orion5x: Revert commit 4904dbda41.
qrtr: add MODULE_ALIAS macro to smd
r8152: fix tx packets accounting
virtio-gpu: fix ioctl and expose the fixed status to userspace.
dmaengine: rcar-dmac: fix max_chunk_size for R-Car Gen3
bcache: fix kcrashes with fio in RAID5 backend dev
ip6_tunnel: fix IFLA_MTU ignored on NEWLINK
sit: fix IFLA_MTU ignored on NEWLINK
ARM: dts: NSP: Fix amount of RAM on BCM958625HR
powerpc/boot: Fix random libfdt related build errors
gianfar: Fix Rx byte accounting for ndev stats
net/tcp/illinois: replace broken algorithm reference link
nvmet: fix PSDT field check in command format
xen/pirq: fix error path cleanup when binding MSIs
drm/sun4i: Fix dclk_set_phase
Btrfs: send, fix issuing write op when processing hole in no data mode
selftests/powerpc: Skip the subpage_prot tests if the syscall is unavailable
KVM: PPC: Book3S HV: Fix VRMA initialization with 2MB or 1GB memory backing
iwlwifi: mvm: fix TX of CCMP 256
watchdog: f71808e_wdt: Fix magic close handling
watchdog: sbsa: use 32-bit read for WCV
batman-adv: Fix multicast packet loss with a single WANT_ALL_IPV4/6 flag
e1000e: Fix check_for_link return value with autoneg off
e1000e: allocate ring descriptors with dma_zalloc_coherent
ia64/err-inject: Use get_user_pages_fast()
RDMA/qedr: Fix kernel panic when running fio over NFSoRDMA
RDMA/qedr: Fix iWARP write and send with immediate
IB/mlx4: Fix corruption of RoCEv2 IPv4 GIDs
IB/mlx4: Include GID type when deleting GIDs from HW table under RoCE
IB/mlx5: Fix an error code in __mlx5_ib_modify_qp()
fbdev: Fixing arbitrary kernel leak in case FBIOGETCMAP_SPARC in sbusfb_ioctl_helper().
fsl/fman: avoid sleeping in atomic context while adding an address
net: qcom/emac: Use proper free methods during TX
net: smsc911x: Fix unload crash when link is up
IB/core: Fix possible crash to access NULL netdev
xen: xenbus: use put_device() instead of kfree()
arm64: Relax ARM_SMCCC_ARCH_WORKAROUND_1 discovery
dmaengine: mv_xor_v2: Fix clock resource by adding a register clock
netfilter: ebtables: fix erroneous reject of last rule
bnxt_en: Check valid VNIC ID in bnxt_hwrm_vnic_set_tpa().
workqueue: use put_device() instead of kfree()
ipv4: lock mtu in fnhe when received PMTU < net.ipv4.route.min_pmtu
sunvnet: does not support GSO for sctp
drm/imx: move arming of the vblank event to atomic_flush
net: Fix vlan untag for bridge and vlan_dev with reorder_hdr off
batman-adv: fix header size check in batadv_dbg_arp()
batman-adv: Fix skbuff rcsum on packet reroute
vti4: Don't count header length twice on tunnel setup
vti4: Don't override MTU passed on link creation via IFLA_MTU
perf/cgroup: Fix child event counting bug
brcmfmac: Fix check for ISO3166 code
kbuild: make scripts/adjust_autoksyms.sh robust against timestamp races
RDMA/ucma: Correct option size check using optlen
RDMA/qedr: fix QP's ack timeout configuration
RDMA/qedr: Fix rc initialization on CNQ allocation failure
mm/mempolicy.c: avoid use uninitialized preferred_node
mm, thp: do not cause memcg oom for thp
selftests: ftrace: Add probe event argument syntax testcase
selftests: ftrace: Add a testcase for string type with kprobe_event
selftests: ftrace: Add a testcase for probepoint
batman-adv: fix multicast-via-unicast transmission with AP isolation
batman-adv: fix packet loss for broadcasted DHCP packets to a server
ARM: 8748/1: mm: Define vdso_start, vdso_end as array
net: qmi_wwan: add BroadMobi BM806U 2020:2033
perf/x86/intel: Fix linear IP of PEBS real_ip on Haswell and later CPUs
llc: properly handle dev_queue_xmit() return value
builddeb: Fix header package regarding dtc source links
mm/kmemleak.c: wait for scan completion before disabling free
net: Fix untag for vlan packets without ethernet header
net: mvneta: fix enable of all initialized RXQs
sh: fix debug trap failure to process signals before return to user
nvme: don't send keep-alives to the discovery controller
x86/pgtable: Don't set huge PUD/PMD on non-leaf entries
x86/mm: Do not forbid _PAGE_RW before init for __ro_after_init
fs/proc/proc_sysctl.c: fix potential page fault while unregistering sysctl table
swap: divide-by-zero when zero length swap file on ssd
sr: get/drop reference to device in revalidate and check_events
Force log to disk before reading the AGF during a fstrim
cpufreq: CPPC: Initialize shared perf capabilities of CPUs
dp83640: Ensure against premature access to PHY registers after reset
mm/ksm: fix interaction with THP
mm: fix races between address_space dereference and free in page_evicatable
Btrfs: bail out on error during replay_dir_deletes
Btrfs: fix NULL pointer dereference in log_dir_items
btrfs: Fix possible softlock on single core machines
ocfs2/dlm: don't handle migrate lockres if already in shutdown
sched/rt: Fix rq->clock_update_flags < RQCF_ACT_SKIP warning
KVM: VMX: raise internal error for exception during invalid protected mode state
fscache: Fix hanging wait on page discarded by writeback
sparc64: Make atomic_xchg() an inline function rather than a macro.
net: bgmac: Fix endian access in bgmac_dma_tx_ring_free()
btrfs: tests/qgroup: Fix wrong tree backref level
Btrfs: fix copy_items() return value when logging an inode
btrfs: fix lockdep splat in btrfs_alloc_subvolume_writers
rxrpc: Fix Tx ring annotation after initial Tx failure
rxrpc: Don't treat call aborts as conn aborts
xen/acpi: off by one in read_acpi_id()
drivers: macintosh: rack-meter: really fix bogus memsets
ACPI: acpi_pad: Fix memory leak in power saving threads
powerpc/mpic: Check if cpu_possible() in mpic_physmask()
m68k: set dma and coherent masks for platform FEC ethernets
parisc/pci: Switch LBA PCI bus from Hard Fail to Soft Fail mode
hwmon: (nct6775) Fix writing pwmX_mode
powerpc/perf: Prevent kernel address leak to userspace via BHRB buffer
powerpc/perf: Fix kernel address leak via sampling registers
tools/thermal: tmon: fix for segfault
selftests: Print the test we're running to /dev/kmsg
net/mlx5: Protect from command bit overflow
ath10k: Fix kernel panic while using worker (ath10k_sta_rc_update_wk)
cxgb4: Setup FW queues before registering netdev
ima: Fallback to the builtin hash algorithm
virtio-net: Fix operstate for virtio when no VIRTIO_NET_F_STATUS
arm: dts: socfpga: fix GIC PPI warning
cpufreq: cppc_cpufreq: Fix cppc_cpufreq_init() failure path
zorro: Set up z->dev.dma_mask for the DMA API
bcache: quit dc->writeback_thread when BCACHE_DEV_DETACHING is set
ACPICA: Events: add a return on failure from acpi_hw_register_read
ACPICA: acpi: acpica: fix acpi operand cache leak in nseval.c
cxgb4: Fix queue free path of ULD drivers
i2c: mv64xxx: Apply errata delay only in standard mode
KVM: lapic: stop advertising DIRECTED_EOI when in-kernel IOAPIC is in use
perf top: Fix top.call-graph config option reading
perf stat: Fix core dump when flag T is used
IB/core: Honor port_num while resolving GID for IB link layer
regulator: gpio: Fix some error handling paths in 'gpio_regulator_probe()'
spi: bcm-qspi: fIX some error handling paths
MIPS: ath79: Fix AR724X_PLL_REG_PCIE_CONFIG offset
PCI: Restore config space on runtime resume despite being unbound
ipmi_ssif: Fix kernel panic at msg_done_handler
powerpc: Add missing prototype for arch_irq_work_raise()
f2fs: fix to check extent cache in f2fs_drop_extent_tree
perf/core: Fix perf_output_read_group()
drm/panel: simple: Fix the bus format for the Ontat panel
hwmon: (pmbus/max8688) Accept negative page register values
hwmon: (pmbus/adm1275) Accept negative page register values
perf/x86/intel: Properly save/restore the PMU state in the NMI handler
cdrom: do not call check_disk_change() inside cdrom_open()
perf/x86/intel: Fix large period handling on Broadwell CPUs
perf/x86/intel: Fix event update for auto-reload
arm64: dts: qcom: Fix SPI5 config on MSM8996
soc: qcom: wcnss_ctrl: Fix increment in NV upload
gfs2: Fix fallocate chunk size
x86/devicetree: Initialize device tree before using it
x86/devicetree: Fix device IRQ settings in DT
ALSA: vmaster: Propagate slave error
dmaengine: pl330: fix a race condition in case of threaded irqs
dmaengine: rcar-dmac: Check the done lists in rcar_dmac_chan_get_residue()
enic: enable rq before updating rq descriptors
hwrng: stm32 - add reset during probe
dmaengine: qcom: bam_dma: get num-channels and num-ees from dt
net: stmmac: ensure that the device has released ownership before reading data
net: stmmac: ensure that the MSS desc is the last desc to set the own bit
cpufreq: Reorder cpufreq_online() error code path
PCI: Add function 1 DMA alias quirk for Marvell 88SE9220
udf: Provide saner default for invalid uid / gid
ARM: dts: bcm283x: Fix probing of bcm2835-i2s
audit: return on memory error to avoid null pointer dereference
rcu: Call touch_nmi_watchdog() while printing stall warnings
pinctrl: sh-pfc: r8a7796: Fix MOD_SEL register pin assignment for SSI pins group
MIPS: Octeon: Fix logging messages with spurious periods after newlines
drm/rockchip: Respect page offset for PRIME mmap calls
x86/apic: Set up through-local-APIC mode on the boot CPU if 'noapic' specified
perf tests: Use arch__compare_symbol_names to compare symbols
perf report: Fix memory corruption in --branch-history mode --branch-history
selftests/net: fixes psock_fanout eBPF test case
netlabel: If PF_INET6, check sk_buff ip header version
regmap: Correct comparison in regmap_cached
ARM: dts: imx7d: cl-som-imx7: fix pinctrl_enet
ARM: dts: porter: Fix HDMI output routing
regulator: of: Add a missing 'of_node_put()' in an error handling path of 'of_regulator_match()'
pinctrl: msm: Use dynamic GPIO numbering
kdb: make "mdr" command repeat
Linux 4.9.104
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit ac7f1061c2 ]
Current code does:
if (sscanf(dentry->d_name.name, "%lx-%lx", start, end) != 2)
However sscanf() is broken garbage.
It silently accepts whitespace between format specifiers
(did you know that?).
It silently accepts valid strings which result in integer overflow.
Do not use sscanf() for any even remotely reliable parsing code.
OK
# readlink '/proc/1/map_files/55a23af39000-55a23b05b000'
/lib/systemd/systemd
broken
# readlink '/proc/1/map_files/ 55a23af39000-55a23b05b000'
/lib/systemd/systemd
broken
# readlink '/proc/1/map_files/55a23af39000-55a23b05b000 '
/lib/systemd/systemd
very broken
# readlink '/proc/1/map_files/1000000000000000055a23af39000-55a23b05b000'
/lib/systemd/systemd
Andrei said:
: This patch breaks criu. It was a bug in criu. And this bug is on a minor
: path, which works when memfd_create() isn't available. It is a reason why
: I ask to not backport this patch to stable kernels.
:
: In CRIU this bug can be triggered, only if this patch will be backported
: to a kernel which version is lower than v3.16.
Link: http://lkml.kernel.org/r/20171120212706.GA14325@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Changes in 4.9.101
8139too: Use disable_irq_nosync() in rtl8139_poll_controller()
bridge: check iface upper dev when setting master via ioctl
dccp: fix tasklet usage
ipv4: fix memory leaks in udp_sendmsg, ping_v4_sendmsg
llc: better deal with too small mtu
net: ethernet: sun: niu set correct packet size in skb
net: ethernet: ti: cpsw: fix packet leaking in dual_mac mode
net/mlx4_en: Verify coalescing parameters are in range
net/mlx5: E-Switch, Include VF RDMA stats in vport statistics
net_sched: fq: take care of throttled flows before reuse
net: support compat 64-bit time in {s,g}etsockopt
openvswitch: Don't swap table in nlattr_set() after OVS_ATTR_NESTED is found
qmi_wwan: do not steal interfaces from class drivers
r8169: fix powering up RTL8168h
sctp: handle two v4 addrs comparison in sctp_inet6_cmp_addr
sctp: remove sctp_chunk_put from fail_mark err path in sctp_ulpevent_make_rcvmsg
sctp: use the old asoc when making the cookie-ack chunk in dupcook_d
tcp_bbr: fix to zero idle_restart only upon S/ACKed data
tg3: Fix vunmap() BUG_ON() triggered from tg3_free_consistent().
bonding: do not allow rlb updates to invalid mac
net/mlx5: Avoid cleaning flow steering table twice during error flow
bonding: send learning packets for vlans on slave
tcp: ignore Fast Open on repair mode
sctp: fix the issue that the cookie-ack with auth can't get processed
sctp: delay the authentication for the duplicated cookie-echo chunk
serial: sccnxp: Fix error handling in sccnxp_probe()
futex: Remove duplicated code and fix undefined behaviour
xfrm: fix xfrm_do_migrate() with AEAD e.g(AES-GCM)
lockd: lost rollback of set_grace_period() in lockd_down_net()
Revert "ARM: dts: imx6qdl-wandboard: Fix audio channel swap"
l2tp: revert "l2tp: fix missing print session offset info"
nfp: TX time stamp packets before HW doorbell is rung
proc: do not access cmdline nor environ from file-backed areas
futex: futex_wake_op, fix sign_extend32 sign bits
kernel/exit.c: avoid undefined behaviour when calling wait4()
Linux 4.9.101
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 7f7ccc2ccc upstream.
proc_pid_cmdline_read() and environ_read() directly access the target
process' VM to retrieve the command line and environment. If this
process remaps these areas onto a file via mmap(), the requesting
process may experience various issues such as extra delays if the
underlying device is slow to respond.
Let's simply refuse to access file-backed areas in these functions.
For this we add a new FOLL_ANON gup flag that is passed to all calls
to access_remote_vm(). The code already takes care of such failures
(including unmapped areas). Accesses via /proc/pid/mem were not
changed though.
This was assigned CVE-2018-1120.
Note for stable backports: the patch may apply to kernels prior to 4.11
but silently miss one location; it must be checked that no call to
access_remote_vm() keeps zero as the last argument.
Reported-by: Qualys Security Advisory <qsa@qualys.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add time in state data to task structs, and create
/proc/<pid>/time_in_state files to show how long each individual task
has run at each frequency.
Create a CONFIG_CPU_FREQ_TIMES option to enable/disable this tracking.
Signed-off-by: Connor O'Brien <connoro@google.com>
Bug: 72339335
Bug: 70951257
Test: Read /proc/<pid>/time_in_state
Change-Id: Ia6456754f4cb1e83b2bc35efa8fbe9f8696febc8
Changes in 4.9.33
PCI/PM: Add needs_resume flag to avoid suspend complete optimization
drm/i915: Prevent the system suspend complete optimization
partitions/msdos: FreeBSD UFS2 file systems are not recognized
netfilter: nf_conntrack_sip: fix wrong memory initialisation
ibmvnic: Fix endian errors in error reporting output
ibmvnic: Fix endian error when requesting device capabilities
net: xilinx_emaclite: fix freezes due to unordered I/O
net: xilinx_emaclite: fix receive buffer overflow
tcp: tcp_probe: use spin_lock_bh()
ipv6: Handle IPv4-mapped src to in6addr_any dst.
ipv6: Inhibit IPv4-mapped src address on the wire.
tipc: Fix tipc_sk_reinit race conditions
gfs2: Use rhashtable walk interface in glock_hash_walk
NET: Fix /proc/net/arp for AX.25
ibmvnic: Call napi_disable instead of napi_enable in failure path
ibmvnic: Initialize completion variables before starting work
NET: mkiss: Fix panic
net: hns: Fix the device being used for dma mapping during TX
sierra_net: Skip validating irrelevant fields for IDLE LSIs
sierra_net: Add support for IPv6 and Dual-Stack Link Sense Indications
i2c: piix4: Request the SMBUS semaphore inside the mutex
i2c: piix4: Fix request_region size
powerpc/powernv: Properly set "host-ipi" on IPIs
kernel/ucount.c: mark user_header with kmemleak_ignore()
net: thunderx: Fix PHY autoneg for SGMII QLM mode
ipv6: addrconf: fix generation of new temporary addresses
vfio/spapr_tce: Set window when adding additional groups to container
ipv6: Fix IPv6 packet loss in scenarios involving roaming + snooping switches
ARM: defconfigs: make NF_CT_PROTO_SCTP and NF_CT_PROTO_UDPLITE built-in
PM / runtime: Avoid false-positive warnings from might_sleep_if()
jump label: pass kbuild_cflags when checking for asm goto support
shmem: fix sleeping from atomic context
kasan: respect /proc/sys/kernel/traceoff_on_warning
log2: make order_base_2() behave correctly on const input value zero
ethtool: do not vzalloc(0) on registers dump
net: phy: Fix lack of reference count on PHY driver
net: phy: Fix PHY module checks and NULL deref in phy_attach_direct()
net: fix ndo_features_check/ndo_fix_features comment ordering
fscache: Fix dead object requeue
fscache: Clear outstanding writes when disabling a cookie
FS-Cache: Initialise stores_lock in netfs cookie
ipv6: fix flow labels when the traffic class is non-0
drm/nouveau: prevent userspace from deleting client object
drm/nouveau/fence/g84-: protect against concurrent access to semaphore buffers
net/mlx4_core: Avoid command timeouts during VF driver device shutdown
gianfar: synchronize DMA API usage by free_skb_rx_queue w/ gfar_new_page
pinctrl: baytrail: Rectify debounce support (part 2)
cec: fix wrong last_la determination
drm: prevent double-(un)registration for connectors
drm: Don't race connector registration
pinctrl: berlin-bg4ct: fix the value for "sd1a" of pin SCRD0_CRD_PRES
net: adaptec: starfire: add checks for dma mapping errors
drm/i915: Check for NULL i915_vma in intel_unpin_fb_obj()
net/mlx5: E-Switch, Err when retrieving steering name-space fails
net/mlx5: Return EOPNOTSUPP when failing to get steering name-space
parisc, parport_gsc: Fixes for printk continuation lines
net: phy: micrel: add support for KSZ8795
gtp: add genl family modules alias
drm/nouveau: Intercept ACPI_VIDEO_NOTIFY_PROBE
drm/nouveau: Rename acpi_work to hpd_work
drm/nouveau: Handle fbcon suspend/resume in seperate worker
drm/nouveau: Don't enabling polling twice on runtime resume
drm/nouveau: Fix drm poll_helper handling
drm/ast: Fixed system hanged if disable P2A
ravb: unmap descriptors when freeing rings
nfs: Fix "Don't increment lock sequence ID after NFS4ERR_MOVED"
nvmet-rdma: Fix missing dma sync to nvme data structures
r8152: avoid start_xmit to call napi_schedule during autosuspend
r8152: check rx after napi is enabled
r8152: re-schedule napi for tx
r8152: fix rtl8152_post_reset function
r8152: avoid start_xmit to schedule napi when napi is disabled
net-next: ethernet: mediatek: change the compatible string
bnxt_en: Fix bnxt_reset() in the slow path task.
bnxt_en: Enhance autoneg support.
bnxt_en: Fix RTNL lock usage on bnxt_update_link().
bnxt_en: Fix RTNL lock usage on bnxt_get_port_module_status().
sctp: sctp gso should set feature with NETIF_F_SG when calling skb_segment
sctp: sctp_addr_id2transport should verify the addr before looking up assoc
usb: musb: Fix external abort on non-linefetch for musb_irq_work()
mn10300: fix build error of missing fpu_save()
romfs: use different way to generate fsid for BLOCK or MTD
frv: add atomic64_add_unless()
frv: add missing atomic64 operations
proc: add a schedule point in proc_pid_readdir()
userfaultfd: fix SIGBUS resulting from false rwsem wakeups
kernel/watchdog.c: move hardlockup detector to separate file
kernel/watchdog.c: move shared definitions to nmi.h
kernel/watchdog: prevent false hardlockup on overloaded system
vhost/vsock: handle vhost_vq_init_access() error
ARC: smp-boot: Decouple Non masters waiting API from jump to entry point
ARCv2: smp-boot: wake_flag polling by non-Masters needs to be uncached
tipc: ignore requests when the connection state is not CONNECTED
tipc: fix connection refcount error
tipc: add subscription refcount to avoid invalid delete
tipc: fix nametbl_lock soft lockup at node/link events
netfilter: nf_tables: fix set->nelems counting with no NLM_F_EXCL
netfilter: nft_log: restrict the log prefix length to 127
RDMA/qedr: Dispatch port active event from qedr_add
RDMA/qedr: Fix and simplify memory leak in PD alloc
RDMA/qedr: Don't reset QP when queues aren't flushed
RDMA/qedr: Don't spam dmesg if QP is in error state
RDMA/qedr: Return max inline data in QP query result
xtensa: don't use linux IRQ #0
s390/kvm: do not rely on the ILC on kvm host protection fauls
drm/i915: Workaround VLV/CHV DSI scanline counter hardware fail
drm/i915: Always recompute watermarks when distrust_bios_wm is set, v2.
sparc64: make string buffers large enough
Linux 4.9.33
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Make oom_adj and oom_score_adj user read-only.
Bug: 19636629
Change-Id: I055bb172d5b4d3d856e25918f3c5de8edf31e4a3
Signed-off-by: Rom Lemarchand <romlem@google.com>
Reading auxv of any kernel thread results in NULL pointer dereferencing
in auxv_read() where mm can be NULL. Fix that by checking for NULL mm
and bailing out early. This is also the original behavior changed by
recent commit c531716785 ("proc: switch auxv to use of __mem_open()").
# cat /proc/2/auxv
Unable to handle kernel NULL pointer dereference at virtual address 000000a8
Internal error: Oops: 17 [#1] PREEMPT SMP ARM
CPU: 3 PID: 113 Comm: cat Not tainted 4.9.0-rc1-ARCH+ #1
Hardware name: BCM2709
task: ea3b0b00 task.stack: e99b2000
PC is at auxv_read+0x24/0x4c
LR is at do_readv_writev+0x2fc/0x37c
Process cat (pid: 113, stack limit = 0xe99b2210)
Call chain:
auxv_read
do_readv_writev
vfs_readv
default_file_splice_read
splice_direct_to_actor
do_splice_direct
do_sendfile
SyS_sendfile64
ret_fast_syscall
Fixes: c531716785 ("proc: switch auxv to use of __mem_open()")
Link: http://lkml.kernel.org/r/1476966200-14457-1-git-send-email-chianglungyu@gmail.com
Signed-off-by: Leon Yu <chianglungyu@gmail.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Janis Danisevskis <jdanis@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that Lorenzo cleaned things up and made the FOLL_FORCE users
explicit, it becomes obvious how some of them don't really need
FOLL_FORCE at all.
So remove FOLL_FORCE from the proc code that reads the command line and
arguments from user space.
The mem_rw() function actually does want FOLL_FORCE, because gdd (and
possibly many other debuggers) use it as a much more convenient version
of PTRACE_PEEKDATA, but we should consider making the FOLL_FORCE part
conditional on actually being a ptracer. This does not actually do
that, just moves adds a comment to that effect and moves the gup_flags
settings next to each other.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This removes the 'write' argument from access_remote_vm() and replaces
it with 'gup_flags' as use of this function previously silently implied
FOLL_FORCE, whereas after this patch callers explicitly pass this flag.
We make this explicit as use of FOLL_FORCE can result in surprising
behaviour (and hence bugs) within the mm subsystem.
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull more vfs updates from Al Viro:
">rename2() work from Miklos + current_time() from Deepa"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: Replace current_fs_time() with current_time()
fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
fs: Replace CURRENT_TIME with current_time() for inode timestamps
fs: proc: Delete inode time initializations in proc_alloc_inode()
vfs: Add current_time() api
vfs: add note about i_op->rename changes to porting
fs: rename "rename2" i_op to "rename"
vfs: remove unused i_op->rename
fs: make remaining filesystems use .rename2
libfs: support RENAME_NOREPLACE in simple_rename()
fs: support RENAME_NOREPLACE for local filesystems
ncpfs: fix unused variable warning
Pull misc vfs updates from Al Viro:
"Assorted misc bits and pieces.
There are several single-topic branches left after this (rename2
series from Miklos, current_time series from Deepa Dinamani, xattr
series from Andreas, uaccess stuff from from me) and I'd prefer to
send those separately"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits)
proc: switch auxv to use of __mem_open()
hpfs: support FIEMAP
cifs: get rid of unused arguments of CIFSSMBWrite()
posix_acl: uapi header split
posix_acl: xattr representation cleanups
fs/aio.c: eliminate redundant loads in put_aio_ring_file
fs/internal.h: add const to ns_dentry_operations declaration
compat: remove compat_printk()
fs/buffer.c: make __getblk_slow() static
proc: unsigned file descriptors
fs/file: more unsigned file descriptors
fs: compat: remove redundant check of nr_segs
cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]
cifs: don't use memcpy() to copy struct iov_iter
get rid of separate multipage fault-in primitives
fs: Avoid premature clearing of capabilities
fs: Give dentry to inode_change_ok() instead of inode
fuse: Propagate dentry down to inode_change_ok()
ceph: Propagate dentry down to inode_change_ok()
xfs: Propagate dentry down to inode_change_ok()
...
When an interface to allow a task to change another tasks timerslack was
first proposed, it was suggested that something greater then
CAP_SYS_NICE would be needed, as a task could be delayed further then
what normally could be done with nice adjustments.
So CAP_SYS_PTRACE was adopted instead for what became the
/proc/<tid>/timerslack_ns interface. However, for Android (where this
feature originates), giving the system_server CAP_SYS_PTRACE would allow
it to observe and modify all tasks memory. This is considered too high
a privilege level for only needing to change the timerslack.
After some discussion, it was realized that a CAP_SYS_NICE process can
set a task as SCHED_FIFO, so they could fork some spinning processes and
set them all SCHED_FIFO 99, in effect delaying all other tasks for an
infinite amount of time.
So as a CAP_SYS_NICE task can already cause trouble for other tasks,
using it as a required capability for accessing and modifying
/proc/<tid>/timerslack_ns seems sufficient.
Thus, this patch loosens the capability requirements to CAP_SYS_NICE and
removes CAP_SYS_PTRACE, simplifying some of the code flow as well.
This is technically an ABI change, but as the feature just landed in
4.6, I suspect no one is yet using it.
Link: http://lkml.kernel.org/r/1469132667-17377-1-git-send-email-john.stultz@linaro.org
Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Nick Kralevich <nnk@google.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Oren Laadan <orenl@cellrox.com>
Cc: Ruchi Kandoi <kandoiruchi@google.com>
Cc: Rom Lemarchand <romlem@android.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Colin Cross <ccross@android.com>
Cc: Nick Kralevich <nnk@google.com>
Cc: Dmitry Shmidt <dimitrysh@google.com>
Cc: Elliott Hughes <enh@google.com>
Cc: Android Kernel Team <kernel-team@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CURRENT_TIME macro is not appropriate for filesystems as it
doesn't use the right granularity for filesystem timestamps.
Use current_time() instead.
CURRENT_TIME is also not y2038 safe.
This is also in preparation for the patch that transitions
vfs timestamps to use 64 bit time and hence make them
y2038 safe. As part of the effort current_time() will be
extended to do range checks. Hence, it is necessary for all
file system timestamps to use current_time(). Also,
current_time() will be transitioned along with vfs to be
y2038 safe.
Note that whenever a single call to current_time() is used
to change timestamps in different inodes, it is because they
share the same time granularity.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Felipe Balbi <balbi@kernel.org>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
inode_change_ok() will be resposible for clearing capabilities and IMA
extended attributes and as such will need dentry. Give it as an argument
to inode_change_ok() instead of an inode. Also rename inode_change_ok()
to setattr_prepare() to better relect that it does also some
modifications in addition to checks.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Pull audit fixes from Paul Moore:
"Two small patches to fix some bugs with the audit-by-executable
functionality we introduced back in v4.3 (both patches are marked
for the stable folks)"
* 'stable-4.8' of git://git.infradead.org/users/pcmoore/audit:
audit: fix exe_file access in audit_exe_compare
mm: introduce get_task_exe_file
For more convenient access if one has a pointer to the task.
As a minor nit take advantage of the fact that only task lock + rcu are
needed to safely grab ->exe_file. This saves mm refcount dance.
Use the helper in proc_exe_link.
Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Richard Guy Briggs <rgb@redhat.com>
Cc: <stable@vger.kernel.org> # 4.3.x
Signed-off-by: Paul Moore <paul@paul-moore.com>
oom_score_adj is shared for the thread groups (via struct signal) but this
is not sufficient to cover processes sharing mm (CLONE_VM without
CLONE_SIGHAND) and so we can easily end up in a situation when some
processes update their oom_score_adj and confuse the oom killer. In the
worst case some of those processes might hide from the oom killer
altogether via OOM_SCORE_ADJ_MIN while others are eligible. OOM killer
would then pick up those eligible but won't be allowed to kill others
sharing the same mm so the mm wouldn't release the mm and so the memory.
It would be ideal to have the oom_score_adj per mm_struct because that is
the natural entity OOM killer considers. But this will not work because
some programs are doing
vfork()
set_oom_adj()
exec()
We can achieve the same though. oom_score_adj write handler can set the
oom_score_adj for all processes sharing the same mm if the task is not in
the middle of vfork. As a result all the processes will share the same
oom_score_adj. The current implementation is rather pessimistic and
checks all the existing processes by default if there is more than 1
holder of the mm but we do not have any reliable way to check for external
users yet.
Link: http://lkml.kernel.org/r/1466426628-15074-5-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Series "Handle oom bypass more gracefully", V5
The following 10 patches should put some order to very rare cases of mm
shared between processes and make the paths which bypass the oom killer
oom reapable and therefore much more reliable finally. Even though mm
shared outside of thread group is rare (either vforked tasks for a short
period, use_mm by kernel threads or exotic thread model of
clone(CLONE_VM) without CLONE_SIGHAND) it is better to cover them. Not
only it makes the current oom killer logic quite hard to follow and
reason about it can lead to weird corner cases. E.g. it is possible to
select an oom victim which shares the mm with unkillable process or
bypass the oom killer even when other processes sharing the mm are still
alive and other weird cases.
Patch 1 drops bogus task_lock and mm check from oom_{score_}adj_write.
This can be considered a bug fix with a low impact as nobody has noticed
for years.
Patch 2 drops sighand lock because it is not needed anymore as pointed
by Oleg.
Patch 3 is a clean up of oom_score_adj handling and a preparatory work
for later patches.
Patch 4 enforces oom_adj_score to be consistent between processes
sharing the mm to behave consistently with the regular thread groups.
This can be considered a user visible behavior change because one thread
group updating oom_score_adj will affect others which share the same mm
via clone(CLONE_VM). I argue that this should be acceptable because we
already have the same behavior for threads in the same thread group and
sharing the mm without signal struct is just a different model of
threading. This is probably the most controversial part of the series,
I would like to find some consensus here. There were some suggestions
to hook some counter/oom_score_adj into the mm_struct but I feel that
this is not necessary right now and we can rely on proc handler +
oom_kill_process to DTRT. I can be convinced otherwise but I strongly
think that whatever we do the userspace has to have a way to see the
current oom priority as consistently as possible.
Patch 5 makes sure that no vforked task is selected if it is sharing the
mm with oom unkillable task.
Patch 6 ensures that all user tasks sharing the mm are killed which in
turn makes sure that all oom victims are oom reapable.
Patch 7 guarantees that task_will_free_mem will always imply reapable
bypass of the oom killer.
Patch 8 is new in this version and it addresses an issue pointed out by
0-day OOM report where an oom victim was reaped several times.
Patch 9 puts an upper bound on how many times oom_reaper tries to reap a
task and hides it from the oom killer to move on when no progress can be
made. This will give an upper bound to how long an oom_reapable task
can block the oom killer from selecting another victim if the oom_reaper
is not able to reap the victim.
Patch 10 tries to plug the (hopefully) last hole when we can still lock
up when the oom victim is shared with oom unkillable tasks (kthreads and
global init). We just try to be best effort in that case and rather
fallback to kill something else than risk a lockup.
This patch (of 10):
Both oom_adj_write and oom_score_adj_write are using task_lock, check for
task->mm and fail if it is NULL. This is not needed because the
oom_score_adj is per signal struct so we do not need mm at all. The code
has been introduced by 3d5992d2ac ("oom: add per-mm oom disable count")
but we do not do per-mm oom disable since c9f01245b6 ("oom: remove
oom_disable_count").
The task->mm check is even not correct because the current thread might
have exited but the thread group might be still alive - e.g. thread group
leader would lead that echo $VAL > /proc/pid/oom_score_adj would always
fail with EINVAL while /proc/pid/task/$other_tid/oom_score_adj would
succeed. This is unexpected at best.
Remove the lock along with the check to fix the unexpected behavior and
also because there is not real need for the lock in the first place.
Link: http://lkml.kernel.org/r/1466426628-15074-2-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The PR_DUMPABLE flag causes the pid related paths of the proc file
system to be owned by ROOT.
The implementation of pthread_set/getname_np however needs access to
/proc/<pid>/task/<tid>/comm. If PR_DUMPABLE is false this
implementation is locked out.
This patch installs a special permission function for the file "comm"
that grants read and write access to all threads of the same group
regardless of the ownership of the inode. For all other threads the
function falls back to the generic inode permission check.
[akpm@linux-foundation.org: fix spello in comment]
Signed-off-by: Janis Danisevskis <jdanis@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Minfei Huang <mnfhuang@gmail.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Calvin Owens <calvinowens@fb.com>
Cc: Jann Horn <jann@thejh.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull parallel filesystem directory handling update from Al Viro.
This is the main parallel directory work by Al that makes the vfs layer
able to do lookup and readdir in parallel within a single directory.
That's a big change, since this used to be all protected by the
directory inode mutex.
The inode mutex is replaced by an rwsem, and serialization of lookups of
a single name is done by a "in-progress" dentry marker.
The series begins with xattr cleanups, and then ends with switching
filesystems over to actually doing the readdir in parallel (switching to
the "iterate_shared()" that only takes the read lock).
A more detailed explanation of the process from Al Viro:
"The xattr work starts with some acl fixes, then switches ->getxattr to
passing inode and dentry separately. This is the point where the
things start to get tricky - that got merged into the very beginning
of the -rc3-based #work.lookups, to allow untangling the
security_d_instantiate() mess. The xattr work itself proceeds to
switch a lot of filesystems to generic_...xattr(); no complications
there.
After that initial xattr work, the series then does the following:
- untangle security_d_instantiate()
- convert a bunch of open-coded lookup_one_len_unlocked() to calls of
that thing; one such place (in overlayfs) actually yields a trivial
conflict with overlayfs fixes later in the cycle - overlayfs ended
up switching to a variant of lookup_one_len_unlocked() sans the
permission checks. I would've dropped that commit (it gets
overridden on merge from #ovl-fixes in #for-next; proper resolution
is to use the variant in mainline fs/overlayfs/super.c), but I
didn't want to rebase the damn thing - it was fairly late in the
cycle...
- some filesystems had managed to depend on lookup/lookup exclusion
for *fs-internal* data structures in a way that would break if we
relaxed the VFS exclusion. Fixing hadn't been hard, fortunately.
- core of that series - parallel lookup machinery, replacing
->i_mutex with rwsem, making lookup_slow() take it only shared. At
that point lookups happen in parallel; lookups on the same name
wait for the in-progress one to be done with that dentry.
Surprisingly little code, at that - almost all of it is in
fs/dcache.c, with fs/namei.c changes limited to lookup_slow() -
making it use the new primitive and actually switching to locking
shared.
- parallel readdir stuff - first of all, we provide the exclusion on
per-struct file basis, same as we do for read() vs lseek() for
regular files. That takes care of most of the needed exclusion in
readdir/readdir; however, these guys are trickier than lookups, so
I went for switching them one-by-one. To do that, a new method
'->iterate_shared()' is added and filesystems are switched to it
as they are either confirmed to be OK with shared lock on directory
or fixed to be OK with that. I hope to kill the original method
come next cycle (almost all in-tree filesystems are switched
already), but it's still not quite finished.
- several filesystems get switched to parallel readdir. The
interesting part here is dealing with dcache preseeding by readdir;
that needs minor adjustment to be safe with directory locked only
shared.
Most of the filesystems doing that got switched to in those
commits. Important exception: NFS. Turns out that NFS folks, with
their, er, insistence on VFS getting the fuck out of the way of the
Smart Filesystem Code That Knows How And What To Lock(tm) have
grown the locking of their own. They had their own homegrown
rwsem, with lookup/readdir/atomic_open being *writers* (sillyunlink
is the reader there). Of course, with VFS getting the fuck out of
the way, as requested, the actual smarts of the smart filesystem
code etc. had become exposed...
- do_last/lookup_open/atomic_open cleanups. As the result, open()
without O_CREAT locks the directory only shared. Including the
->atomic_open() case. Backmerge from #for-linus in the middle of
that - atomic_open() fix got brought in.
- then comes NFS switch to saner (VFS-based ;-) locking, killing the
homegrown "lookup and readdir are writers" kinda-sorta rwsem. All
exclusion for sillyunlink/lookup is done by the parallel lookups
mechanism. Exclusion between sillyunlink and rmdir is a real rwsem
now - rmdir being the writer.
Result: NFS lookups/readdirs/O_CREAT-less opens happen in parallel
now.
- the rest of the series consists of switching a lot of filesystems
to parallel readdir; in a lot of cases ->llseek() gets simplified
as well. One backmerge in there (again, #for-linus - rockridge
fix)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (74 commits)
ext4: switch to ->iterate_shared()
hfs: switch to ->iterate_shared()
hfsplus: switch to ->iterate_shared()
hostfs: switch to ->iterate_shared()
hpfs: switch to ->iterate_shared()
hpfs: handle allocation failures in hpfs_add_pos()
gfs2: switch to ->iterate_shared()
f2fs: switch to ->iterate_shared()
afs: switch to ->iterate_shared()
befs: switch to ->iterate_shared()
befs: constify stuff a bit
isofs: switch to ->iterate_shared()
get_acorn_filename(): deobfuscate a bit
btrfs: switch to ->iterate_shared()
logfs: no need to lock directory in lseek
switch ecryptfs to ->iterate_shared
9p: switch to ->iterate_shared()
fat: switch to ->iterate_shared()
romfs, squashfs: switch to ->iterate_shared()
more trivial ->iterate_shared conversions
...
Backmerge to resolve a conflict in ovl_lookup_real();
"ovl_lookup_real(): use lookup_one_len_unlocked()" instead,
but it was too late in the cycle to rebase.
This reverts the 4.6-rc1 commit 7e2bc81da3 ("proc/base: make prompt
shell start from new line after executing "cat /proc/$pid/wchan")
because it breaks /proc/$PID/whcan formatting in ps and top.
Revert also because the patch is inconsistent - it adds a newline at the
end of only the '0' wchan, and does not add a newline when
/proc/$PID/wchan contains a symbol name.
eg.
$ ps -eo pid,stat,wchan,comm
PID STAT WCHAN COMMAND
...
1189 S - dbus-launch
1190 Ssl 0
dbus-daemon
1198 Sl 0
lightdm
1299 Ss ep_pol systemd
1301 S - (sd-pam)
1304 Ss wait sh
Signed-off-by: Robin Humble <plaguedbypenguins@gmail.com>
Cc: Minfei Huang <mnfhuang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If /proc/<PID>/environ gets read before the envp[] array is fully set up
in create_{aout,elf,elf_fdpic,flat}_tables(), we might end up trying to
read more bytes than are actually written, as env_start will already be
set but env_end will still be zero, making the range calculation
underflow, allowing to read beyond the end of what has been written.
Fix this as it is done for /proc/<PID>/cmdline by testing env_end for
zero. It is, apparently, intentionally set last in create_*_tables().
This bug was found by the PaX size_overflow plugin that detected the
arithmetic underflow of 'this_len = env_end - (env_start + src)' when
env_end is still zero.
The expected consequence is that userland trying to access
/proc/<PID>/environ of a not yet fully set up process may get
inconsistent data as we're in the middle of copying in the environment
variables.
Fixes: https://forums.grsecurity.net/viewtopic.php?f=3&t=4363
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=116461
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Emese Revfy <re.emese@gmail.com>
Cc: Pax Team <pageexec@freemail.hu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Jarod Wilson <jarod@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is not elegant that prompt shell does not start from new line after
executing "cat /proc/$pid/wchan". Make prompt shell start from new
line.
Signed-off-by: Minfei Huang <mnfhuang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch provides a proc/PID/timerslack_ns interface which exposes a
task's timerslack value in nanoseconds and allows it to be changed.
This allows power/performance management software to set timer slack for
other threads according to its policy for the thread (such as when the
thread is designated foreground vs. background activity)
If the value written is non-zero, slack is set to that value. Otherwise
sets it to the default for the thread.
This interface checks that the calling task has permissions to to use
PTRACE_MODE_ATTACH_FSCREDS on the target task, so that we can ensure
arbitrary apps do not change the timer slack for other apps.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Oren Laadan <orenl@cellrox.com>
Cc: Ruchi Kandoi <kandoiruchi@google.com>
Cc: Rom Lemarchand <romlem@android.com>
Cc: Android Kernel Team <kernel-team@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
By checking the effective credentials instead of the real UID / permitted
capabilities, ensure that the calling process actually intended to use its
credentials.
To ensure that all ptrace checks use the correct caller credentials (e.g.
in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
flag), use two new flags and require one of them to be set.
The problem was that when a privileged task had temporarily dropped its
privileges, e.g. by calling setreuid(0, user_uid), with the intent to
perform following syscalls with the credentials of a user, it still passed
ptrace access checks that the user would not be able to pass.
While an attacker should not be able to convince the privileged task to
perform a ptrace() syscall, this is a problem because the ptrace access
check is reused for things in procfs.
In particular, the following somewhat interesting procfs entries only rely
on ptrace access checks:
/proc/$pid/stat - uses the check for determining whether pointers
should be visible, useful for bypassing ASLR
/proc/$pid/maps - also useful for bypassing ASLR
/proc/$pid/cwd - useful for gaining access to restricted
directories that contain files with lax permissions, e.g. in
this scenario:
lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
drwx------ root root /root
drwxr-xr-x root root /root/foobar
-rw-r--r-- root root /root/foobar/secret
Therefore, on a system where a root-owned mode 6755 binary changes its
effective credentials as described and then dumps a user-specified file,
this could be used by an attacker to reveal the memory layout of root's
processes or reveal the contents of files he is not allowed to access
(through /proc/$pid/cwd).
[akpm@linux-foundation.org: fix warning]
Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>