mirror of
https://github.com/hardkernel/linux.git
synced 2026-04-02 03:03:00 +09:00
eb5683dcbffe69df0dbbcd7eb80fe853728ed40d
312 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
9797dcb8c7 |
Merge 4.9.104 into android-4.9
Changes in 4.9.104
MIPS: c-r4k: Fix data corruption related to cache coherence
MIPS: ptrace: Expose FIR register through FP regset
MIPS: Fix ptrace(2) PTRACE_PEEKUSR and PTRACE_POKEUSR accesses to o32 FGRs
KVM: Fix spelling mistake: "cop_unsuable" -> "cop_unusable"
affs_lookup(): close a race with affs_remove_link()
aio: fix io_destroy(2) vs. lookup_ioctx() race
ALSA: timer: Fix pause event notification
do d_instantiate/unlock_new_inode combinations safely
mmc: sdhci-iproc: remove hard coded mmc cap 1.8v
mmc: sdhci-iproc: fix 32bit writes for TRANSFER_MODE register
libata: Blacklist some Sandisk SSDs for NCQ
libata: blacklist Micron 500IT SSD with MU01 firmware
xen-swiotlb: fix the check condition for xen_swiotlb_free_coherent
drm/vmwgfx: Fix 32-bit VMW_PORT_HB_[IN|OUT] macros
IB/hfi1: Use after free race condition in send context error path
Revert "ipc/shm: Fix shmat mmap nil-page protection"
ipc/shm: fix shmat() nil address after round-down when remapping
kasan: fix memory hotplug during boot
kernel/sys.c: fix potential Spectre v1 issue
kernel/signal.c: avoid undefined behaviour in kill_something_info
KVM/VMX: Expose SSBD properly to guests
KVM: s390: vsie: fix < 8k check for the itdba
KVM: x86: Update cpuid properly when CR4.OSXAVE or CR4.PKE is changed
kvm: x86: IA32_ARCH_CAPABILITIES is always supported
firewire-ohci: work around oversized DMA reads on JMicron controllers
x86/tsc: Allow TSC calibration without PIT
NFSv4: always set NFS_LOCK_LOST when a lock is lost.
ALSA: hda - Use IS_REACHABLE() for dependency on input
kvm: x86: fix KVM_XEN_HVM_CONFIG ioctl
netfilter: ipv6: nf_defrag: Pass on packets to stack per RFC2460
tracing/hrtimer: Fix tracing bugs by taking all clock bases and modes into account
PCI: Add function 1 DMA alias quirk for Marvell 9128
Input: psmouse - fix Synaptics detection when protocol is disabled
i40iw: Zero-out consumer key on allocate stag for FMR
tools lib traceevent: Simplify pointer print logic and fix %pF
perf callchain: Fix attr.sample_max_stack setting
tools lib traceevent: Fix get_field_str() for dynamic strings
perf record: Fix failed memory allocation for get_cpuid_str
iommu/vt-d: Use domain instead of cache fetching
dm thin: fix documentation relative to low water mark threshold
net: stmmac: dwmac-meson8b: fix setting the RGMII TX clock on Meson8b
net: stmmac: dwmac-meson8b: propagate rate changes to the parent clock
nfs: Do not convert nfs_idmap_cache_timeout to jiffies
watchdog: sp5100_tco: Fix watchdog disable bit
kconfig: Don't leak main menus during parsing
kconfig: Fix automatic menu creation mem leak
kconfig: Fix expr_free() E_NOT leak
mac80211_hwsim: fix possible memory leak in hwsim_new_radio_nl()
ipmi/powernv: Fix error return code in ipmi_powernv_probe()
Btrfs: set plug for fsync
btrfs: Fix out of bounds access in btrfs_search_slot
Btrfs: fix scrub to repair raid6 corruption
btrfs: fail mount when sb flag is not in BTRFS_SUPER_FLAG_SUPP
HID: roccat: prevent an out of bounds read in kovaplus_profile_activated()
fm10k: fix "failed to kill vid" message for VF
device property: Define type of PROPERTY_ENRTY_*() macros
jffs2: Fix use-after-free bug in jffs2_iget()'s error handling path
powerpc/numa: Use ibm,max-associativity-domains to discover possible nodes
powerpc/numa: Ensure nodes initialized for hotplug
RDMA/mlx5: Avoid memory leak in case of XRCD dealloc failure
ntb_transport: Fix bug with max_mw_size parameter
gianfar: prevent integer wrapping in the rx handler
tcp_nv: fix potential integer overflow in tcpnv_acked
kvm: Map PFN-type memory regions as writable (if possible)
ocfs2: return -EROFS to mount.ocfs2 if inode block is invalid
ocfs2/acl: use 'ip_xattr_sem' to protect getting extended attribute
ocfs2: return error when we attempt to access a dirty bh in jbd2
mm/mempolicy: fix the check of nodemask from user
mm/mempolicy: add nodes_empty check in SYSC_migrate_pages
asm-generic: provide generic_pmdp_establish()
sparc64: update pmdp_invalidate() to return old pmd value
mm: thp: use down_read_trylock() in khugepaged to avoid long block
mm: pin address_space before dereferencing it while isolating an LRU page
mm/fadvise: discard partial page if endbyte is also EOF
openvswitch: Remove padding from packet before L3+ conntrack processing
IB/ipoib: Fix for potential no-carrier state
drm/nouveau/pmu/fuc: don't use movw directly anymore
netfilter: ipv6: nf_defrag: Kill frag queue on RFC2460 failure
x86/power: Fix swsusp_arch_resume prototype
firmware: dmi_scan: Fix handling of empty DMI strings
ACPI: processor_perflib: Do not send _PPC change notification if not ready
ACPI / scan: Use acpi_bus_get_status() to initialize ACPI_TYPE_DEVICE devs
bpf: fix selftests/bpf test_kmod.sh failure when CONFIG_BPF_JIT_ALWAYS_ON=y
MIPS: generic: Fix machine compatible matching
MIPS: TXx9: use IS_BUILTIN() for CONFIG_LEDS_CLASS
xen-netfront: Fix race between device setup and open
xen/grant-table: Use put_page instead of free_page
RDS: IB: Fix null pointer issue
arm64: spinlock: Fix theoretical trylock() A-B-A with LSE atomics
proc: fix /proc/*/map_files lookup
cifs: silence compiler warnings showing up with gcc-8.0.0
bcache: properly set task state in bch_writeback_thread()
bcache: fix for allocator and register thread race
bcache: fix for data collapse after re-attaching an attached device
bcache: return attach error when no cache set exist
tools/libbpf: handle issues with bpf ELF objects containing .eh_frames
bpf: fix rlimit in reuseport net selftest
vfs/proc/kcore, x86/mm/kcore: Fix SMAP fault when dumping vsyscall user page
locking/qspinlock: Ensure node->count is updated before initialising node
irqchip/gic-v3: Ignore disabled ITS nodes
cpumask: Make for_each_cpu_wrap() available on UP as well
irqchip/gic-v3: Change pr_debug message to pr_devel
ARC: Fix malformed ARC_EMUL_UNALIGNED default
ptr_ring: prevent integer overflow when calculating size
libata: Fix compile warning with ATA_DEBUG enabled
selftests: pstore: Adding config fragment CONFIG_PSTORE_RAM=m
selftests: memfd: add config fragment for fuse
ARM: OMAP2+: timer: fix a kmemleak caused in omap_get_timer_dt
ARM: OMAP3: Fix prm wake interrupt for resume
ARM: OMAP1: clock: Fix debugfs_create_*() usage
ibmvnic: Free RX socket buffer in case of adapter error
iwlwifi: mvm: fix security bug in PN checking
iwlwifi: mvm: always init rs with 20mhz bandwidth rates
NFC: llcp: Limit size of SDP URI
rxrpc: Work around usercopy check
mac80211: round IEEE80211_TX_STATUS_HEADROOM up to multiple of 4
mac80211: fix a possible leak of station stats
mac80211: fix calling sleeping function in atomic context
mac80211: Do not disconnect on invalid operating class
md raid10: fix NULL deference in handle_write_completed()
drm/exynos: g2d: use monotonic timestamps
drm/exynos: fix comparison to bitshift when dealing with a mask
locking/xchg/alpha: Add unconditional memory barrier to cmpxchg()
md: raid5: avoid string overflow warning
kernel/relay.c: limit kmalloc size to KMALLOC_MAX_SIZE
powerpc/bpf/jit: Fix 32-bit JIT for seccomp_data access
s390/cio: fix ccw_device_start_timeout API
s390/cio: fix return code after missing interrupt
s390/cio: clear timer when terminating driver I/O
PKCS#7: fix direct verification of SignerInfo signature
ARM: OMAP: Fix dmtimer init for omap1
smsc75xx: fix smsc75xx_set_features()
regulatory: add NUL to request alpha2
integrity/security: fix digsig.c build error with header file
locking/xchg/alpha: Fix xchg() and cmpxchg() memory ordering bugs
x86/topology: Update the 'cpu cores' field in /proc/cpuinfo correctly across CPU hotplug operations
mac80211: drop frames with unexpected DS bits from fast-rx to slow path
arm64: fix unwind_frame() for filtered out fn for function graph tracing
macvlan: fix use-after-free in macvlan_common_newlink()
kvm: fix warning for CONFIG_HAVE_KVM_EVENTFD builds
fs: dcache: Avoid livelock between d_alloc_parallel and __d_add
fs: dcache: Use READ_ONCE when accessing i_dir_seq
md: fix a potential deadlock of raid5/raid10 reshape
md/raid1: fix NULL pointer dereference
batman-adv: fix packet checksum in receive path
batman-adv: invalidate checksum on fragment reassembly
netfilter: ebtables: convert BUG_ONs to WARN_ONs
batman-adv: Ignore invalid batadv_iv_gw during netlink send
batman-adv: Ignore invalid batadv_v_gw during netlink send
batman-adv: Fix netlink dumping of BLA claims
batman-adv: Fix netlink dumping of BLA backbones
nvme-pci: Fix nvme queue cleanup if IRQ setup fails
clocksource/drivers/fsl_ftm_timer: Fix error return checking
ceph: fix dentry leak when failing to init debugfs
ARM: orion5x: Revert commit
|
||
|
|
960828aaa0 |
kernel/sys.c: fix potential Spectre v1 issue
commit
|
||
|
|
9c3041c524 |
Merge 4.9.102 into android-4.9
Changes in 4.9.102
usbip: usbip_host: refine probe and disconnect debug msgs to be useful
usbip: usbip_host: delete device from busid_table after rebind
usbip: usbip_host: run rebind from exit when module is removed
usbip: usbip_host: fix NULL-ptr deref and use-after-free errors
usbip: usbip_host: fix bad unlock balance during stub_probe()
ALSA: usb: mixer: volume quirk for CM102-A+/102S+
ALSA: hda: Add Lenovo C50 All in one to the power_save blacklist
ALSA: control: fix a redundant-copy issue
spi: pxa2xx: Allow 64-bit DMA
spi: bcm-qspi: Avoid setting MSPI_CDRAM_PCS for spi-nor master
spi: bcm-qspi: Always read and set BSPI_MAST_N_BOOT_CTRL
KVM: arm/arm64: VGIC/ITS: protect kvm_read_guest() calls with SRCU lock
powerpc: Don't preempt_disable() in show_cpuinfo()
signals: avoid unnecessary taking of sighand->siglock
tracing/x86/xen: Remove zero data size trace events trace_xen_mmu_flush_tlb{_all}
netfilter: nf_tables: can't fail after linking rule into active rule list
i2c: designware: fix poll-after-enable regression
powerpc/powernv: Fix NVRAM sleep in invalid context when crashing
mm: don't allow deferred pages with NEED_PER_CPU_KM
s390/qdio: fix access to uninitialized qdio_q fields
s390/cpum_sf: ensure sample frequency of perf event attributes is non-zero
s390/qdio: don't release memory in qdio_setup_irq()
s390: remove indirect branch from do_softirq_own_stack
x86/pkeys: Override pkey when moving away from PROT_EXEC
x86/pkeys: Do not special case protection key 0
efi: Avoid potential crashes, fix the 'struct efi_pci_io_protocol_32' definition for mixed mode
ARM: 8771/1: kprobes: Prohibit kprobes on do_undefinstr
tick/broadcast: Use for_each_cpu() specially on UP kernels
ARM: 8769/1: kprobes: Fix to use get_kprobe_ctlblk after irq-disabed
ARM: 8770/1: kprobes: Prohibit probing on optimized_callback
ARM: 8772/1: kprobes: Prohibit kprobes on get_user functions
Btrfs: fix xattr loss after power failure
btrfs: fix crash when trying to resume balance without the resume flag
x86/amd: don't set X86_BUG_SYSRET_SS_ATTRS when running under Xen
btrfs: fix reading stale metadata blocks after degraded raid1 mounts
x86/nospec: Simplify alternative_msr_write()
x86/bugs: Concentrate bug detection into a separate function
x86/bugs: Concentrate bug reporting into a separate function
x86/bugs: Read SPEC_CTRL MSR during boot and re-use reserved bits
x86/bugs, KVM: Support the combination of guest and host IBRS
x86/bugs: Expose /sys/../spec_store_bypass
x86/cpufeatures: Add X86_FEATURE_RDS
x86/bugs: Provide boot parameters for the spec_store_bypass_disable mitigation
x86/bugs/intel: Set proper CPU features and setup RDS
x86/bugs: Whitelist allowed SPEC_CTRL MSR values
x86/bugs/AMD: Add support to disable RDS on Fam[15,16,17]h if requested
x86/KVM/VMX: Expose SPEC_CTRL Bit(2) to the guest
x86/speculation: Create spec-ctrl.h to avoid include hell
prctl: Add speculation control prctls
x86/process: Optimize TIF checks in __switch_to_xtra()
x86/process: Correct and optimize TIF_BLOCKSTEP switch
x86/process: Optimize TIF_NOTSC switch
x86/process: Allow runtime control of Speculative Store Bypass
x86/speculation: Add prctl for Speculative Store Bypass mitigation
nospec: Allow getting/setting on non-current task
proc: Provide details on speculation flaw mitigations
seccomp: Enable speculation flaw mitigations
x86/bugs: Make boot modes __ro_after_init
prctl: Add force disable speculation
seccomp: Use PR_SPEC_FORCE_DISABLE
seccomp: Add filter flag to opt-out of SSB mitigation
seccomp: Move speculation migitation control to arch code
x86/speculation: Make "seccomp" the default mode for Speculative Store Bypass
x86/bugs: Rename _RDS to _SSBD
proc: Use underscores for SSBD in 'status'
Documentation/spec_ctrl: Do some minor cleanups
x86/bugs: Fix __ssb_select_mitigation() return type
x86/bugs: Make cpu_show_common() static
x86/bugs: Fix the parameters alignment and missing void
x86/cpu: Make alternative_msr_write work for 32-bit code
KVM: SVM: Move spec control call after restore of GS
x86/speculation: Use synthetic bits for IBRS/IBPB/STIBP
x86/cpufeatures: Disentangle MSR_SPEC_CTRL enumeration from IBRS
x86/cpufeatures: Disentangle SSBD enumeration
x86/cpu/AMD: Fix erratum 1076 (CPB bit)
x86/cpufeatures: Add FEATURE_ZEN
x86/speculation: Handle HT correctly on AMD
x86/bugs, KVM: Extend speculation control for VIRT_SPEC_CTRL
x86/speculation: Add virtualized speculative store bypass disable support
x86/speculation: Rework speculative_store_bypass_update()
x86/bugs: Unify x86_spec_ctrl_{set_guest,restore_host}
x86/bugs: Expose x86_spec_ctrl_base directly
x86/bugs: Remove x86_spec_ctrl_set()
x86/bugs: Rework spec_ctrl base and mask logic
x86/speculation, KVM: Implement support for VIRT_SPEC_CTRL/LS_CFG
KVM: SVM: Implement VIRT_SPEC_CTRL support for SSBD
x86/bugs: Rename SSBD_NO to SSB_NO
Linux 4.9.102
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
|
||
|
|
4272f528da |
nospec: Allow getting/setting on non-current task
commit
|
||
|
|
4812ffbbfc |
prctl: Add speculation control prctls
commit
|
||
|
|
3e4578f42f |
ANDROID: mm: add a field to store names for private anonymous memory
Userspace processes often have multiple allocators that each do anonymous mmaps to get memory. When examining memory usage of individual processes or systems as a whole, it is useful to be able to break down the various heaps that were allocated by each layer and examine their size, RSS, and physical memory usage. This patch adds a user pointer to the shared union in vm_area_struct that points to a null terminated string inside the user process containing a name for the vma. vmas that point to the same address will be merged, but vmas that point to equivalent strings at different addresses will not be merged. Userspace can set the name for a region of memory by calling prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name); Setting the name to NULL clears it. The names of named anonymous vmas are shown in /proc/pid/maps as [anon:<name>] and in /proc/pid/smaps in a new "Name" field that is only present for named vmas. If the userspace pointer is no longer valid all or part of the name will be replaced with "<fault>". The idea to store a userspace pointer to reduce the complexity within mm (at the expense of the complexity of reading /proc/pid/mem) came from Dave Hansen. This results in no runtime overhead in the mm subsystem other than comparing the anon_name pointers when considering vma merging. The pointer is stored in a union with fieds that are only used on file-backed mappings, so it does not increase memory usage. Includes fix from Jed Davis <jld@mozilla.com> for typo in prctl_set_vma_anon_name, which could attempt to set the name across two vmas at the same time due to a typo, which might corrupt the vma list. Fix it to use tmp instead of end to limit the name setting to a single vma at a time. Change-Id: I9aa7b6b5ef536cd780599ba4e2fba8ceebe8b59f Signed-off-by: Dmitry Shmidt <dimitrysh@google.com> |
||
|
|
17b0573d77 |
prctl: make PR_SET_THP_DISABLE wait for mmap_sem killable
PR_SET_THP_DISABLE requires mmap_sem for write. If the waiting task gets killed by the oom killer it would block oom_reaper from asynchronous address space reclaim and reduce the chances of timely OOM resolving. Wait for the lock in the killable mode and return with EINTR if the task got killed while waiting. Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Alex Thorlton <athorlton@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
da8b44d5a9 |
timer: convert timer_slack_ns from unsigned long to u64
This patchset introduces a /proc/<pid>/timerslack_ns interface which would allow controlling processes to be able to set the timerslack value on other processes in order to save power by avoiding wakeups (Something Android currently does via out-of-tree patches). The first patch tries to fix the internal timer_slack_ns usage which was defined as a long, which limits the slack range to ~4 seconds on 32bit systems. It converts it to a u64, which provides the same basically unlimited slack (500 years) on both 32bit and 64bit machines. The second patch introduces the /proc/<pid>/timerslack_ns interface which allows the full 64bit slack range for a task to be read or set on both 32bit and 64bit machines. With these two patches, on a 32bit machine, after setting the slack on bash to 10 seconds: $ time sleep 1 real 0m10.747s user 0m0.001s sys 0m0.005s The first patch is a little ugly, since I had to chase the slack delta arguments through a number of functions converting them to u64s. Let me know if it makes sense to break that up more or not. Other than that things are fairly straightforward. This patch (of 2): The timer_slack_ns value in the task struct is currently a unsigned long. This means that on 32bit applications, the maximum slack is just over 4 seconds. However, on 64bit machines, its much much larger (~500 years). This disparity could make application development a little (as well as the default_slack) to a u64. This means both 32bit and 64bit systems have the same effective internal slack range. Now the existing ABI via PR_GET_TIMERSLACK and PR_SET_TIMERSLACK specify the interface as a unsigned long, so we preserve that limitation on 32bit systems, where SET_TIMERSLACK can only set the slack to a unsigned long value, and GET_TIMERSLACK will return ULONG_MAX if the slack is actually larger then what can be stored by an unsigned long. This patch also modifies hrtimer functions which specified the slack delta as a unsigned long. Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Oren Laadan <orenl@cellrox.com> Cc: Ruchi Kandoi <kandoiruchi@google.com> Cc: Rom Lemarchand <romlem@android.com> Cc: Kees Cook <keescook@chromium.org> Cc: Android Kernel Team <kernel-team@android.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
ddf1d398e5 |
prctl: take mmap sem for writing to protect against others
An unprivileged user can trigger an oops on a kernel with
CONFIG_CHECKPOINT_RESTORE.
proc_pid_cmdline_read takes mmap_sem for reading and obtains args + env
start/end values. These get sanity checked as follows:
BUG_ON(arg_start > arg_end);
BUG_ON(env_start > env_end);
These can be changed by prctl_set_mm. Turns out also takes the semaphore for
reading, effectively rendering it useless. This results in:
kernel BUG at fs/proc/base.c:240!
invalid opcode: 0000 [#1] SMP
Modules linked in: virtio_net
CPU: 0 PID: 925 Comm: a.out Not tainted 4.4.0-rc8-next-20160105dupa+ #71
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff880077a68000 ti: ffff8800784d0000 task.ti: ffff8800784d0000
RIP: proc_pid_cmdline_read+0x520/0x530
RSP: 0018:ffff8800784d3db8 EFLAGS: 00010206
RAX: ffff880077c5b6b0 RBX: ffff8800784d3f18 RCX: 0000000000000000
RDX: 0000000000000002 RSI: 00007f78e8857000 RDI: 0000000000000246
RBP: ffff8800784d3e40 R08: 0000000000000008 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000050
R13: 00007f78e8857800 R14: ffff88006fcef000 R15: ffff880077c5b600
FS: 00007f78e884a740(0000) GS:ffff88007b200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f78e8361770 CR3: 00000000790a5000 CR4: 00000000000006f0
Call Trace:
__vfs_read+0x37/0x100
vfs_read+0x82/0x130
SyS_read+0x58/0xd0
entry_SYSCALL_64_fastpath+0x12/0x76
Code: 4c 8b 7d a8 eb e9 48 8b 9d 78 ff ff ff 4c 8b 7d 90 48 8b 03 48 39 45 a8 0f 87 f0 fe ff ff e9 d1 fe ff ff 4c 8b 7d 90 eb c6 0f 0b <0f> 0b 0f 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
RIP proc_pid_cmdline_read+0x520/0x530
---[ end trace 97882617ae9c6818 ]---
Turns out there are instances where the code just reads aformentioned
values without locking whatsoever - namely environ_read and get_cmdline.
Interestingly these functions look quite resilient against bogus values,
but I don't believe this should be relied upon.
The first patch gets rid of the oops bug by grabbing mmap_sem for
writing.
The second patch is optional and puts locking around aformentioned
consumers for safety. Consumers of other fields don't seem to benefit
from similar treatment and are left untouched.
This patch (of 2):
The code was taking the semaphore for reading, which does not protect
against readers nor concurrent modifications.
The problem could cause a sanity checks to fail in procfs's cmdline
reader, resulting in an OOPS.
Note that some functions perform an unlocked read of various mm fields,
but they seem to be fine despite possible modificaton.
Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Jarod Wilson <jarod@redhat.com>
Cc: Jan Stancek <jstancek@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anshuman Khandual <anshuman.linux@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
8639b46139 |
pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode
setpriority(PRIO_USER, 0, x) will change the priority of tasks outside of the current pid namespace. This is in contrast to both the other modes of setpriority and the example of kill(-1). Fix this. getpriority and ioprio have the same failure mode, fix them too. Eric said: : After some more thinking about it this patch sounds justifiable. : : My goal with namespaces is not to build perfect isolation mechanisms : as that can get into ill defined territory, but to build well defined : mechanisms. And to handle the corner cases so you can use only : a single namespace with well defined results. : : In this case you have found the two interfaces I am aware of that : identify processes by uid instead of by pid. Which quite frankly is : weird. Unfortunately the weird unexpected cases are hard to handle : in the usual way. : : I was hoping for a little more information. Changes like this one we : have to be careful of because someone might be depending on the current : behavior. I don't think they are and I do think this make sense as part : of the pid namespace. Signed-off-by: Ben Segall <bsegall@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ambrose Feinstein <ambrose@google.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
90f8572b0f |
vfs: Commit to never having exectuables on proc and sysfs.
Today proc and sysfs do not contain any executable files. Several applications today mount proc or sysfs without noexec and nosuid and then depend on there being no exectuables files on proc or sysfs. Having any executable files show on proc or sysfs would cause a user space visible regression, and most likely security problems. Therefore commit to never allowing executables on proc and sysfs by adding a new flag to mark them as filesystems without executables and enforce that flag. Test the flag where MNT_NOEXEC is tested today, so that the only user visible effect will be that exectuables will be treated as if the execute bit is cleared. The filesystems proc and sysfs do not currently incoporate any executable files so this does not result in any user visible effects. This makes it unnecessary to vet changes to proc and sysfs tightly for adding exectuable files or changes to chattr that would modify existing files, as no matter what the individual file say they will not be treated as exectuable files by the vfs. Not having to vet changes to closely is important as without this we are only one proc_create call (or another goof up in the implementation of notify_change) from having problematic executables on proc. Those mistakes are all too easy to make and would create a situation where there are security issues or the assumptions of some program having to be broken (and cause userspace regressions). Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
||
|
|
4a00e9df29 |
prctl: more prctl(PR_SET_MM_*) checks
Individual prctl(PR_SET_MM_*) calls do some checking to maintain a
consistent view of mm->arg_start et al fields, but not enough. In
particular PR_SET_MM_ARG_START/PR_SET_MM_ARG_END/ R_SET_MM_ENV_START/
PR_SET_MM_ENV_END only check that the address lies in an existing VMA,
but don't check that the start address is lower than the end address _at
all_.
Consolidate all consistency checks, so there will be no difference in
the future between PR_SET_MM_MAP and individual PR_SET_MM_* calls.
The program below makes both ARGV and ENVP areas be reversed. It makes
/proc/$PID/cmdline show garbage (it doesn't oops by luck).
#include <sys/mman.h>
#include <sys/prctl.h>
#include <unistd.h>
enum {PAGE_SIZE=4096};
int main(void)
{
void *p;
p = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
#define PR_SET_MM 35
#define PR_SET_MM_ARG_START 8
#define PR_SET_MM_ARG_END 9
#define PR_SET_MM_ENV_START 10
#define PR_SET_MM_ENV_END 11
prctl(PR_SET_MM, PR_SET_MM_ARG_START, (unsigned long)p + PAGE_SIZE - 1, 0, 0);
prctl(PR_SET_MM, PR_SET_MM_ARG_END, (unsigned long)p, 0, 0);
prctl(PR_SET_MM, PR_SET_MM_ENV_START, (unsigned long)p + PAGE_SIZE - 1, 0, 0);
prctl(PR_SET_MM, PR_SET_MM_ENV_END, (unsigned long)p, 0, 0);
pause();
return 0;
}
[akpm@linux-foundation.org: tidy code, tweak comment]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Jarod Wilson <jarod@redhat.com>
Cc: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
46a6e0cf1c |
x86/mpx: Clean up the code by not passing a task pointer around when unnecessary
The MPX code can only work on the current task. You can not, for instance, enable MPX management in another process or thread. You can also not handle a fault for another process or thread. Despite this, we pass a task_struct around prolifically. This patch removes all of the task struct passing for code paths where the code can not deal with another task (which turns out to be all of them). This has no functional changes. It's just a cleanup. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave@sr71.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: bp@alien8.de Link: http://lkml.kernel.org/r/20150607183702.6A81DA2C@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
6e399cd144 |
prctl: avoid using mmap_sem for exe_file serialization
Oleg cleverly suggested using xchg() to set the new mm->exe_file instead of calling set_mm_exe_file() which requires some form of serialization -- mmap_sem in this case. For archs that do not have atomic rmw instructions we still fallback to a spinlock alternative, so this should always be safe. As such, we only need the mmap_sem for looking up the backing vm_file, which can be done sharing the lock. Naturally, this means we need to manually deal with both the new and old file reference counting, and we need not worry about the MMF_EXE_FILE_CHANGED bits, which can probably be deleted in the future anyway. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Suggested-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
2813893f8b |
kernel: conditionally support non-root users, groups and capabilities
There are a lot of embedded systems that run most or all of their functionality in init, running as root:root. For these systems, supporting multiple users is not necessary. This patch adds a new symbol, CONFIG_MULTIUSER, that makes support for non-root users, non-root groups, and capabilities optional. It is enabled under CONFIG_EXPERT menu. When this symbol is not defined, UID and GID are zero in any possible case and processes always have all capabilities. The following syscalls are compiled out: setuid, setregid, setgid, setreuid, setresuid, getresuid, setresgid, getresgid, setgroups, getgroups, setfsuid, setfsgid, capget, capset. Also, groups.c is compiled out completely. In kernel/capability.c, capable function was moved in order to avoid adding two ifdef blocks. This change saves about 25 KB on a defconfig build. The most minimal kernels have total text sizes in the high hundreds of kB rather than low MB. (The 25k goes down a bit with allnoconfig, but not that much. The kernel was booted in Qemu. All the common functionalities work. Adding users/groups is not possible, failing with -ENOSYS. Bloat-o-meter output: add/remove: 7/87 grow/shrink: 19/397 up/down: 1675/-26325 (-24650) [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Iulia Manda <iulia.manda21@gmail.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
39afb5ee46 |
kernel/sys.c: fix UNAME26 for 4.0
There's a uname workaround for broken userspace which can't handle kernel versions of 3.x. Update it for 4.x. Signed-off-by: Jon DeVree <nuxi@vault24.org> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
a135c717d5 |
Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus
Pull MIPS updates from Ralf Baechle:
"This is the main pull request for MIPS:
- a number of fixes that didn't make the 3.19 release.
- a number of cleanups.
- preliminary support for Cavium's Octeon 3 SOCs which feature up to
48 MIPS64 R3 cores with FPU and hardware virtualization.
- support for MIPS R6 processors.
Revision 6 of the MIPS architecture is a major revision of the MIPS
architecture which does away with many of original sins of the
architecture such as branch delay slots. This and other changes in
R6 require major changes throughout the entire MIPS core
architecture code and make up for the lion share of this pull
request.
- finally some preparatory work for eXtendend Physical Address
support, which allows support of up to 40 bit of physical address
space on 32 bit processors"
[ Ahh, MIPS can't leave the PAE brain damage alone. It's like
every CPU architect has to make that mistake, but pee in the snow
by changing the TLA. But whether it's called PAE, LPAE or XPA,
it's horrid crud - Linus ]
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (114 commits)
MIPS: sead3: Corrected get_c0_perfcount_int
MIPS: mm: Remove dead macro definitions
MIPS: OCTEON: irq: add CIB and other fixes
MIPS: OCTEON: Don't do acknowledge operations for level triggered irqs.
MIPS: OCTEON: More OCTEONIII support
MIPS: OCTEON: Remove setting of processor specific CVMCTL icache bits.
MIPS: OCTEON: Core-15169 Workaround and general CVMSEG cleanup.
MIPS: OCTEON: Update octeon-model.h code for new SoCs.
MIPS: OCTEON: Implement DCache errata workaround for all CN6XXX
MIPS: OCTEON: Add little-endian support to asm/octeon/octeon.h
MIPS: OCTEON: Implement the core-16057 workaround
MIPS: OCTEON: Delete unused COP2 saving code
MIPS: OCTEON: Use correct instruction to read 64-bit COP0 register
MIPS: OCTEON: Save and restore CP2 SHA3 state
MIPS: OCTEON: Fix FP context save.
MIPS: OCTEON: Save/Restore wider multiply registers in OCTEON III CPUs
MIPS: boot: Provide more uImage options
MIPS: Remove unneeded #ifdef __KERNEL__ from asm/processor.h
MIPS: ip22-gio: Remove legacy suspend/resume support
mips: pci: Add ifdef around pci_proc_domain
...
|
||
|
|
9791554b45 |
MIPS,prctl: add PR_[GS]ET_FP_MODE prctl options for MIPS
Userland code may be built using an ABI which permits linking to objects that have more restrictive floating point requirements. For example, userland code may be built to target the O32 FPXX ABI. Such code may be linked with other FPXX code, or code built for either one of the more restrictive FP32 or FP64. When linking with more restrictive code, the overall requirement of the process becomes that of the more restrictive code. The kernel has no way to know in advance which mode the process will need to be executed in, and indeed it may need to change during execution. The dynamic loader is the only code which will know the overall required mode, and so it needs to have a means to instruct the kernel to switch the FP mode of the process. This patch introduces 2 new options to the prctl syscall which provide such a capability. The FP mode of the process is represented as a simple bitmask combining a number of mode bits mirroring those present in the hardware. Userland can either retrieve the current FP mode of the process: mode = prctl(PR_GET_FP_MODE); or modify the current FP mode of the process: err = prctl(PR_SET_FP_MODE, new_mode); Signed-off-by: Paul Burton <paul.burton@imgtec.com> Cc: Matthew Fortune <matthew.fortune@imgtec.com> Cc: Markos Chandras <markos.chandras@imgtec.com> Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/8899/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org> |
||
|
|
e9d1b4f3c6 |
x86, mpx: Strictly enforce empty prctl() args
Description from Michael Kerrisk. He suggested an identical patch
to one I had already coded up and tested.
commit
|
||
|
|
fe3d197f84 |
x86, mpx: On-demand kernel allocation of bounds tables
This is really the meat of the MPX patch set. If there is one patch to review in the entire series, this is the one. There is a new ABI here and this kernel code also interacts with userspace memory in a relatively unusual manner. (small FAQ below). Long Description: This patch adds two prctl() commands to provide enable or disable the management of bounds tables in kernel, including on-demand kernel allocation (See the patch "on-demand kernel allocation of bounds tables") and cleanup (See the patch "cleanup unused bound tables"). Applications do not strictly need the kernel to manage bounds tables and we expect some applications to use MPX without taking advantage of this kernel support. This means the kernel can not simply infer whether an application needs bounds table management from the MPX registers. The prctl() is an explicit signal from userspace. PR_MPX_ENABLE_MANAGEMENT is meant to be a signal from userspace to require kernel's help in managing bounds tables. PR_MPX_DISABLE_MANAGEMENT is the opposite, meaning that userspace don't want kernel's help any more. With PR_MPX_DISABLE_MANAGEMENT, the kernel won't allocate and free bounds tables even if the CPU supports MPX. PR_MPX_ENABLE_MANAGEMENT will fetch the base address of the bounds directory out of a userspace register (bndcfgu) and then cache it into a new field (->bd_addr) in the 'mm_struct'. PR_MPX_DISABLE_MANAGEMENT will set "bd_addr" to an invalid address. Using this scheme, we can use "bd_addr" to determine whether the management of bounds tables in kernel is enabled. Also, the only way to access that bndcfgu register is via an xsaves, which can be expensive. Caching "bd_addr" like this also helps reduce the cost of those xsaves when doing table cleanup at munmap() time. Unfortunately, we can not apply this optimization to #BR fault time because we need an xsave to get the value of BNDSTATUS. ==== Why does the hardware even have these Bounds Tables? ==== MPX only has 4 hardware registers for storing bounds information. If MPX-enabled code needs more than these 4 registers, it needs to spill them somewhere. It has two special instructions for this which allow the bounds to be moved between the bounds registers and some new "bounds tables". They are similar conceptually to a page fault and will be raised by the MPX hardware during both bounds violations or when the tables are not present. This patch handles those #BR exceptions for not-present tables by carving the space out of the normal processes address space (essentially calling the new mmap() interface indroduced earlier in this patch set.) and then pointing the bounds-directory over to it. The tables *need* to be accessed and controlled by userspace because the instructions for moving bounds in and out of them are extremely frequent. They potentially happen every time a register pointing to memory is dereferenced. Any direct kernel involvement (like a syscall) to access the tables would obviously destroy performance. ==== Why not do this in userspace? ==== This patch is obviously doing this allocation in the kernel. However, MPX does not strictly *require* anything in the kernel. It can theoretically be done completely from userspace. Here are a few ways this *could* be done. I don't think any of them are practical in the real-world, but here they are. Q: Can virtual space simply be reserved for the bounds tables so that we never have to allocate them? A: As noted earlier, these tables are *HUGE*. An X-GB virtual area needs 4*X GB of virtual space, plus 2GB for the bounds directory. If we were to preallocate them for the 128TB of user virtual address space, we would need to reserve 512TB+2GB, which is larger than the entire virtual address space today. This means they can not be reserved ahead of time. Also, a single process's pre-popualated bounds directory consumes 2GB of virtual *AND* physical memory. IOW, it's completely infeasible to prepopulate bounds directories. Q: Can we preallocate bounds table space at the same time memory is allocated which might contain pointers that might eventually need bounds tables? A: This would work if we could hook the site of each and every memory allocation syscall. This can be done for small, constrained applications. But, it isn't practical at a larger scale since a given app has no way of controlling how all the parts of the app might allocate memory (think libraries). The kernel is really the only place to intercept these calls. Q: Could a bounds fault be handed to userspace and the tables allocated there in a signal handler instead of in the kernel? A: (thanks to tglx) mmap() is not on the list of safe async handler functions and even if mmap() would work it still requires locking or nasty tricks to keep track of the allocation state there. Having ruled out all of the userspace-only approaches for managing bounds tables that we could think of, we create them on demand in the kernel. Based-on-patch-by: Qiaowei Ren <qiaowei.ren@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: linux-mm@kvack.org Cc: linux-mips@linux-mips.org Cc: Dave Hansen <dave@sr71.net> Link: http://lkml.kernel.org/r/20141114151829.AD4310DE@viggo.jf.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> |
||
|
|
faafcba3b5 |
Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle were:
- Optimized support for Intel "Cluster-on-Die" (CoD) topologies (Dave
Hansen)
- Various sched/idle refinements for better idle handling (Nicolas
Pitre, Daniel Lezcano, Chuansheng Liu, Vincent Guittot)
- sched/numa updates and optimizations (Rik van Riel)
- sysbench speedup (Vincent Guittot)
- capacity calculation cleanups/refactoring (Vincent Guittot)
- Various cleanups to thread group iteration (Oleg Nesterov)
- Double-rq-lock removal optimization and various refactorings
(Kirill Tkhai)
- various sched/deadline fixes
... and lots of other changes"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
sched/dl: Use dl_bw_of() under rcu_read_lock_sched()
sched/fair: Delete resched_cpu() from idle_balance()
sched, time: Fix build error with 64 bit cputime_t on 32 bit systems
sched: Improve sysbench performance by fixing spurious active migration
sched/x86: Fix up typo in topology detection
x86, sched: Add new topology for multi-NUMA-node CPUs
sched/rt: Use resched_curr() in task_tick_rt()
sched: Use rq->rd in sched_setaffinity() under RCU read lock
sched: cleanup: Rename 'out_unlock' to 'out_free_new_mask'
sched: Use dl_bw_of() under RCU read lock
sched/fair: Remove duplicate code from can_migrate_task()
sched, mips, ia64: Remove __ARCH_WANT_UNLOCKED_CTXSW
sched: print_rq(): Don't use tasklist_lock
sched: normalize_rt_tasks(): Don't use _irqsave for tasklist_lock, use task_rq_lock()
sched: Fix the task-group check in tg_has_rt_tasks()
sched/fair: Leverage the idle state info when choosing the "idlest" cpu
sched: Let the scheduler see CPU idle states
sched/deadline: Fix inter- exclusive cpusets migrations
sched/deadline: Clear dl_entity params when setscheduling to different class
sched/numa: Kill the wrong/dead TASK_DEAD check in task_numa_fault()
...
|
||
|
|
0baae41ea8 |
kernel/sys.c: compat sysinfo syscall: fix undefined behavior
Fix undefined behavior and compiler warning by replacing right shift 32 with upper_32_bits macro Signed-off-by: Scotty Bauer <sbauer@eng.utah.edu> Cc: Clemens Ladisch <clemens@ladisch.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
ec94fc3d59 |
kernel/sys.c: whitespace fixes
Fix minor errors and warning messages in kernel/sys.c. These errors were reported by checkpatch while working with some modifications in sys.c file. Fixing this first will help me to improve my further patches. ERROR: trailing whitespace - 9 ERROR: do not use assignment in if condition - 4 ERROR: spaces required around that '?' (ctx:VxO) - 10 ERROR: switch and case should be at the same indent - 3 total 26 errors & 3 warnings fixed. Signed-off-by: vishnu.ps <vishnu.ps@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
96dad67ff2 |
mm: use VM_BUG_ON_MM where possible
Dump the contents of the relevant struct_mm when we hit the bug condition. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
f606b77f1a |
prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation
During development of c/r we've noticed that in case if we need to support
user namespaces we face a problem with capabilities in prctl(PR_SET_MM,
...) call, in particular once new user namespace is created
capable(CAP_SYS_RESOURCE) no longer passes.
A approach is to eliminate CAP_SYS_RESOURCE check but pass all new values
in one bundle, which would allow the kernel to make more intensive test
for sanity of values and same time allow us to support checkpoint/restore
of user namespaces.
Thus a new command PR_SET_MM_MAP introduced. It takes a pointer of
prctl_mm_map structure which carries all the members to be updated.
prctl(PR_SET_MM, PR_SET_MM_MAP, struct prctl_mm_map *, size)
struct prctl_mm_map {
__u64 start_code;
__u64 end_code;
__u64 start_data;
__u64 end_data;
__u64 start_brk;
__u64 brk;
__u64 start_stack;
__u64 arg_start;
__u64 arg_end;
__u64 env_start;
__u64 env_end;
__u64 *auxv;
__u32 auxv_size;
__u32 exe_fd;
};
All members except @exe_fd correspond ones of struct mm_struct. To figure
out which available values these members may take here are meanings of the
members.
- start_code, end_code: represent bounds of executable code area
- start_data, end_data: represent bounds of data area
- start_brk, brk: used to calculate bounds for brk() syscall
- start_stack: used when accounting space needed for command
line arguments, environment and shmat() syscall
- arg_start, arg_end, env_start, env_end: represent memory area
supplied for command line arguments and environment variables
- auxv, auxv_size: carries auxiliary vector, Elf format specifics
- exe_fd: file descriptor number for executable link (/proc/self/exe)
Thus we apply the following requirements to the values
1) Any member except @auxv, @auxv_size, @exe_fd is rather an address
in user space thus it must be laying inside [mmap_min_addr, mmap_max_addr)
interval.
2) While @[start|end]_code and @[start|end]_data may point to an nonexisting
VMAs (say a program maps own new .text and .data segments during execution)
the rest of members should belong to VMA which must exist.
3) Addresses must be ordered, ie @start_ member must not be greater or
equal to appropriate @end_ member.
4) As in regular Elf loading procedure we require that @start_brk and
@brk be greater than @end_data.
5) If RLIMIT_DATA rlimit is set to non-infinity new values should not
exceed existing limit. Same applies to RLIMIT_STACK.
6) Auxiliary vector size must not exceed existing one (which is
predefined as AT_VECTOR_SIZE and depends on architecture).
7) File descriptor passed in @exe_file should be pointing
to executable file (because we use existing prctl_set_mm_exe_file_locked
helper it ensures that the file we are going to use as exe link has all
required permission granted).
Now about where these members are involved inside kernel code:
- @start_code and @end_code are used in /proc/$pid/[stat|statm] output;
- @start_data and @end_data are used in /proc/$pid/[stat|statm] output,
also they are considered if there enough space for brk() syscall
result if RLIMIT_DATA is set;
- @start_brk shown in /proc/$pid/stat output and accounted in brk()
syscall if RLIMIT_DATA is set; also this member is tested to
find a symbolic name of mmap event for perf system (we choose
if event is generated for "heap" area); one more aplication is
selinux -- we test if a process has PROCESS__EXECHEAP permission
if trying to make heap area being executable with mprotect() syscall;
- @brk is a current value for brk() syscall which lays inside heap
area, it's shown in /proc/$pid/stat. When syscall brk() succesfully
provides new memory area to a user space upon brk() completion the
mm::brk is updated to carry new value;
Both @start_brk and @brk are actively used in /proc/$pid/maps
and /proc/$pid/smaps output to find a symbolic name "heap" for
VMA being scanned;
- @start_stack is printed out in /proc/$pid/stat and used to
find a symbolic name "stack" for task and threads in
/proc/$pid/maps and /proc/$pid/smaps output, and as the same
as with @start_brk -- perf system uses it for event naming.
Also kernel treat this member as a start address of where
to map vDSO pages and to check if there is enough space
for shmat() syscall;
- @arg_start, @arg_end, @env_start and @env_end are printed out
in /proc/$pid/stat. Another access to the data these members
represent is to read /proc/$pid/environ or /proc/$pid/cmdline.
Any attempt to read these areas kernel tests with access_process_vm
helper so a user must have enough rights for this action;
- @auxv and @auxv_size may be read from /proc/$pid/auxv. Strictly
speaking kernel doesn't care much about which exactly data is
sitting there because it is solely for userspace;
- @exe_fd is referred from /proc/$pid/exe and when generating
coredump. We uses prctl_set_mm_exe_file_locked helper to update
this member, so exe-file link modification remains one-shot
action.
Still note that updating exe-file link now doesn't require sys-resource
capability anymore, after all there is no much profit in preventing setup
own file link (there are a number of ways to execute own code -- ptrace,
ld-preload, so that the only reliable way to find which exactly code is
executed is to inspect running program memory). Still we require the
caller to be at least user-namespace root user.
I believe the old interface should be deprecated and ripped off in a
couple of kernel releases if no one against.
To test if new interface is implemented in the kernel one can pass
PR_SET_MM_MAP_SIZE opcode and the kernel returns the size of currently
supported struct prctl_mm_map.
[akpm@linux-foundation.org: fix 80-col wordwrap in macro definitions]
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Andrew Vagin <avagin@openvz.org>
Tested-by: Andrew Vagin <avagin@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Julien Tinnes <jln@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
71fe97e185 |
prctl: PR_SET_MM -- factor out mmap_sem when updating mm::exe_file
Instead of taking mm->mmap_sem inside prctl_set_mm_exe_file() move it out and rename the helper to prctl_set_mm_exe_file_locked(). This will allow to reuse this function in a next patch. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Kees Cook <keescook@chromium.org> Cc: Tejun Heo <tj@kernel.org> Cc: Andrew Vagin <avagin@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: H. Peter Anvin <hpa@zytor.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Vasiliy Kulikov <segoon@openwall.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Julien Tinnes <jln@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
8764b338b3 |
mm: use may_adjust_brk helper
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Kees Cook <keescook@chromium.org> Cc: Tejun Heo <tj@kernel.org> Cc: Andrew Vagin <avagin@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: H. Peter Anvin <hpa@zytor.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Vasiliy Kulikov <segoon@openwall.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Julien Tinnes <jln@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
e78c349679 |
time, signal: Protect resource use statistics with seqlock
Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability issues on large systems, due to both functions being serialized with a lock. The lock protects against reporting a wrong value, due to a thread in the task group exiting, its statistics reporting up to the signal struct, and that exited task's statistics being counted twice (or not at all). Protecting that with a lock results in times() and clock_gettime() being completely serialized on large systems. This can be fixed by using a seqlock around the events that gather and propagate statistics. As an additional benefit, the protection code can be moved into thread_group_cputime(), slightly simplifying the calling functions. In the case of posix_cpu_clock_get_task() things can be simplified a lot, because the calling function already ensures that the task sticks around, and the rest is now taken care of in thread_group_cputime(). This way the statistics reporting code can run lockless. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alex Thorlton <athorlton@sgi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Daeseok Youn <daeseok.youn@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guillaume Morin <guillaume@morinfr.org> Cc: Ionut Alexa <ionut.m.alexa@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Michal Schmidt <mschmidt@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: umgwanakikbuti@gmail.com Cc: fweisbec@gmail.com Cc: srao@redhat.com Cc: lwoodman@redhat.com Cc: atheurer@redhat.com Link: http://lkml.kernel.org/r/20140816134010.26a9b572@annuminas.surriel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
1d4457f999 |
sched: move no_new_privs into new atomic flags
Since seccomp transitions between threads requires updates to the no_new_privs flag to be atomic, the flag must be part of an atomic flag set. This moves the nnp flag into a separate task field, and introduces accessors. Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Andy Lutomirski <luto@amacapital.net> |
||
|
|
7aa2c016db |
sched: Consolidate open coded implementations of nice level frobbing into nice_to_rlimit() and rlimit_to_nice()
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/a568a1e3cc8e78648f41b5035fa5e381d36274da.1399532322.git.yangds.fnst@cn.fujitsu.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
a0715cc226 |
mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE
Add VM_INIT_DEF_MASK, to allow us to set the default flags for VMs. It also adds a prctl control which allows us to set the THP disable bit in mm->def_flags so that VMs will pick up the setting as they are created. Signed-off-by: Alex Thorlton <athorlton@sgi.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
c4a4d2f431 |
sys: Replace hardcoding of -20 and 19 with MIN_NICE and MAX_NICE
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Robin Holt <holt@sgi.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Kees Cook <keescook@chromium.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/0261f094b836f1acbcdf52e7166487c0c77323c8.1392103744.git.yangds.fnst@cn.fujitsu.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
2e1f383582 |
kernel/sys.c: k_getrusage() can use while_each_thread()
Change k_getrusage() to use while_each_thread(), no changes in the compiled code. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Kees Cook <keescook@chromium.org> Reviewed-by: Sameer Nanda <snanda@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
98611e4e6a |
exec: kill task_struct->did_exec
We can kill either task->did_exec or PF_FORKNOEXEC, they are mutually exclusive. The patch kills ->did_exec because it has a single user. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
81e41ea25b |
kernel/sys.c: remove obsolete #include <linux/kexec.h>
Commit
|
||
|
|
c7b96acf14 |
userns: Kill nsown_capable it makes the wrong thing easy
nsown_capable is a special case of ns_capable essentially for just CAP_SETUID and CAP_SETGID. For the existing users it doesn't noticably simplify things and from the suggested patches I have seen it encourages people to do the wrong thing. So remove nsown_capable. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
||
|
|
15d94b8256 |
reboot: move shutdown/reboot related functions to kernel/reboot.c
This patch is preparatory. It moves reboot related syscall, etc functions from kernel/sys.c to kernel/reboot.c. Signed-off-by: Robin Holt <holt@sgi.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Russ Anderson <rja@sgi.com> Cc: Robin Holt <holt@sgi.com> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
0efbee7089 |
reboot: remove -stable friendly PF_THREAD_BOUND define
Remove the prior patch's #define for easier backporting to the stable releases. Signed-off-by: Robin Holt <holt@sgi.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Russ Anderson <rja@sgi.com> Cc: Robin Holt <holt@sgi.com> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
81dabb4641 |
exit.c: unexport __set_special_pids()
Move __set_special_pids() from exit.c to sys.c close to its single caller and make it static. And rename it to set_special_pids(), another helper with this name has gone away. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
45c64940c8 |
kernel/sys.c:do_sysinfo(): use get_monotonic_boottime()
Change do_sysinfo() to use get_monotonic_boottime() instead of do_posix_clock_monotonic_gettime() + monotonic_to_bootbased(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: John Stultz <johnstul@us.ibm.com> Cc: Tomas Janousek <tjanouse@redhat.com> Cc: Tomas Smetana <tsmetana@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
7ec75e1ca1 |
kernel/sys.c: sys_reboot(): fix malformed panic message
If LINUX_REBOOT_CMD_HALT for reboot failed, the message "cannot halt" will stay on the same line with the next message, so append a '\n'. Signed-off-by: liguang <lig.fnst@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
cf7df378aa |
reboot: rigrate shutdown/reboot to boot cpu
We recently noticed that reboot of a 1024 cpu machine takes approx 16
minutes of just stopping the cpus. The slowdown was tracked to commit
|
||
|
|
08d7676083 |
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal
Pull compat cleanup from Al Viro:
"Mostly about syscall wrappers this time; there will be another pile
with patches in the same general area from various people, but I'd
rather push those after both that and vfs.git pile are in."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
syscalls.h: slightly reduce the jungles of macros
get rid of union semop in sys_semctl(2) arguments
make do_mremap() static
sparc: no need to sign-extend in sync_file_range() wrapper
ppc compat wrappers for add_key(2) and request_key(2) are pointless
x86: trim sys_ia32.h
x86: sys32_kill and sys32_mprotect are pointless
get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC
merge compat sys_ipc instances
consolidate compat lookup_dcookie()
convert vmsplice to COMPAT_SYSCALL_DEFINE
switch getrusage() to COMPAT_SYSCALL_DEFINE
switch epoll_pwait to COMPAT_SYSCALL_DEFINE
convert sendfile{,64} to COMPAT_SYSCALL_DEFINE
switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE
make SYSCALL_DEFINE<n>-generated wrappers do asmlinkage_protect
make HAVE_SYSCALL_WRAPPERS unconditional
consolidate cond_syscall and SYSCALL_ALIAS declarations
teach SYSCALL_DEFINE<n> how to deal with long long/unsigned long long
get rid of duplicate logics in __SC_....[1-6] definitions
|
||
|
|
52b3694157 |
kernel/sys.c: make prctl(PR_SET_MM) generally available
The purpose of this patch is to allow privileged processes to set
their own per-memory memory-region fields:
start_code, end_code, start_data, end_data, start_brk, brk,
start_stack, arg_start, arg_end, env_start, env_end.
This functionality is needed by any application or package that needs to
reconstruct Linux processes, that is, to start them in any way other than
by means of an "execve()" from an executable file. This includes:
1. Restoring processes from a checkpoint-file (by all potential
user-level checkpointing packages, not only CRIU's).
2. Restarting processes on another node after process migration.
3. Starting duplicated copies of a running process (for reliability
and high-availablity).
4. Starting a process from an executable format that is not supported
by Linux, thus requiring a "manual execve" by a user-level utility.
5. Similarly, starting a process from a networked and/or crypted
executable that, for confidentiality, licensing or other reasons,
may not be written to the local file-systems.
The code that does that was already included in the Linux kernel by the
CRIU group, in the form of "prctl(PR_SET_MM)", but prior to this was
enclosed within their private "#ifdef CONFIG_CHECKPOINT_RESTORE", which is
normally disabled. The patch removes those ifdefs.
Signed-off-by: Amnon Shiloh <u3557@miso.sublimeip.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
4a22f16636 |
kernel/timer.c: move some non timer related syscalls to kernel/sys.c
Andrew Morton noted: akpm3:/usr/src/25> grep SYSCALL kernel/timer.c SYSCALL_DEFINE1(alarm, unsigned int, seconds) SYSCALL_DEFINE0(getpid) SYSCALL_DEFINE0(getppid) SYSCALL_DEFINE0(getuid) SYSCALL_DEFINE0(geteuid) SYSCALL_DEFINE0(getgid) SYSCALL_DEFINE0(getegid) SYSCALL_DEFINE0(gettid) SYSCALL_DEFINE1(sysinfo, struct sysinfo __user *, info) COMPAT_SYSCALL_DEFINE1(sysinfo, struct compat_sysinfo __user *, info) Only one of those should be in kernel/timer.c. Who wrote this thing? [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
6f389a8f1d |
PM / reboot: call syscore_shutdown() after disable_nonboot_cpus()
As commit |
||
|
|
2ca067efd8 |
poweroff: change orderly_poweroff() to use schedule_work()
David said:
Commit
|
||
|
|
8d2d5c4a25 |
switch getrusage() to COMPAT_SYSCALL_DEFINE
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
||
|
|
7ff6764061 |
usermodehelper: cleanup/fix __orderly_poweroff() && argv_free()
__orderly_poweroff() does argv_free() if call_usermodehelper_fns()
returns -ENOMEM. As Lucas pointed out, this can be wrong if -ENOMEM was
not triggered by the failing call_usermodehelper_setup(), in this case
both __orderly_poweroff() and argv_cleanup() can do kfree().
Kill argv_cleanup() and change __orderly_poweroff() to call argv_free()
unconditionally like do_coredump() does. This info->cleanup() is not
needed (and wrong) since
|
||
|
|
d895cb1af1 |
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile (part one) from Al Viro: "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent locking violations, etc. The most visible changes here are death of FS_REVAL_DOT (replaced with "has ->d_weak_revalidate()") and a new helper getting from struct file to inode. Some bits of preparation to xattr method interface changes. Misc patches by various people sent this cycle *and* ocfs2 fixes from several cycles ago that should've been upstream right then. PS: the next vfs pile will be xattr stuff." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits) saner proc_get_inode() calling conventions proc: avoid extra pde_put() in proc_fill_super() fs: change return values from -EACCES to -EPERM fs/exec.c: make bprm_mm_init() static ocfs2/dlm: use GFP_ATOMIC inside a spin_lock ocfs2: fix possible use-after-free with AIO ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero target: writev() on single-element vector is pointless export kernel_write(), convert open-coded instances fs: encode_fh: return FILEID_INVALID if invalid fid_type kill f_vfsmnt vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op nfsd: handle vfs_getattr errors in acl protocol switch vfs_getattr() to struct path default SET_PERSONALITY() in linux/elf.h ceph: prepopulate inodes only when request is aborted d_hash_and_lookup(): export, switch open-coded instances 9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate() 9p: split dropping the acls from v9fs_set_create_acl() ... |