linux

mirror of https://github.com/hardkernel/linux.git synced 2026-06-07 19:30:30 +09:00

Author	SHA1	Message	Date
Xunlei Pang	4b97cecd54	kexec: introduce a protection mechanism for the crashkernel reserved memory For the cases that some kernel (module) path stamps the crash reserved memory(already mapped by the kernel) where has been loaded the second kernel data, the kdump kernel will probably fail to boot when panic happens (or even not happens) leaving the culprit at large, this is unacceptable. The patch introduces a mechanism for detecting such cases: 1) After each crash kexec loading, it simply marks the reserved memory regions readonly since we no longer access it after that. When someone stamps the region, the first kernel will panic and trigger the kdump. The weak arch_kexec_protect_crashkres() is introduced to do the actual protection. 2) To allow multiple loading, once 1) was done we also need to remark the reserved memory to readwrite each time a system call related to kdump is made. The weak arch_kexec_unprotect_crashkres() is introduced to do the actual protection. The architecture can make its specific implementation by overriding arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres(). Signed-off-by: Xunlei Pang <xlpang@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Dave Young <dyoung@redhat.com> Cc: Minfei Huang <mhuang@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit `9b492cf580`) Signed-off-by: Alex Shi <alex.shi@linaro.org>	2017-10-25 11:23:52 +08:00
Catalin Marinas	8db423d761	kvm: arm64: Disable compiler instrumentation for hypervisor code With the recent rewrite of the arm64 KVM hypervisor code in C, enabling certain options like KASAN would allow the compiler to generate memory accesses or function calls to addresses not mapped at EL2. This patch disables the compiler instrumentation on the arm64 hypervisor code for gcov-based profiling (GCOV_KERNEL), undefined behaviour sanity checker (UBSAN) and kernel address sanitizer (KASAN). Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoffer Dall <christoffer.dall@linaro.org> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: <stable@vger.kernel.org> # 4.5+ Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> (cherry picked from commit `a6cdf1c08c`) Signed-off-by: Alex Shi <alex.shi@linaro.org> Conflicts: arch/arm64/kvm/hyp/Makefile	2017-08-14 11:17:55 +08:00
Sameer Goel	89912966d1	efi/libstub/arm: Set default address and size cells values for an empty dtb In cases where a device tree is not provided (ie ACPI based system), an empty fdt is generated by efistub. #address-cells and #size-cells are not set in the empty fdt, so they default to 1 (4 byte wide). This can be an issue on 64-bit systems where values representing addresses, etc may be 8 bytes wide as the default value does not align with the general requirements for an empty DTB, and is fragile when passed to other agents as extra care is required to read the entire width of a value. This issue is observed on Qualcomm Technologies QDF24XX platforms when kexec-tools inserts 64-bit addresses into the "linux,elfcorehdr" and "linux,usable-memory-range" properties of the fdt. When the values are later consumed, they are truncated to 32-bit. Setting #address-cells and #size-cells to 2 at creation of the empty fdt resolves the observed issue, and makes the fdt less fragile. Signed-off-by: Sameer Goel <sgoel@codeaurora.org> Signed-off-by: Jeffrey Hugo <jhugo@codeaurora.org> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org> Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Conflicts: drivers/firmware/efi/libstub/fdt.c due to missing commit `abfb7b686a` ("efi/libstub/arm: Pass latest memory map to the kernel")	2017-06-19 15:09:08 +09:00
James Morse	9987607e80	Documentation: dt: chosen properties for arm64 kdump Add documentation for DT properties: linux,usable-memory-range linux,elfcorehdr used by arm64 kdump. Those are, respectively, a usable memory range allocated to crash dump kernel and the elfcorehdr's location within it. Signed-off-by: James Morse <james.morse@arm.com> [takahiro.akashi@linaro.org: update the text due to recent changes ] Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Cc: devicetree@vger.kernel.org Cc: Rob Herring <robh+dt@kernel.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2017-06-19 15:09:07 +09:00
AKASHI Takahiro	b07152e0d9	Documentation: kdump: describe arm64 port Add arch specific descriptions about kdump usage on arm64 to kdump.txt. Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org> Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: Dave Young <dyoung@redhat.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2017-06-19 15:09:07 +09:00
AKASHI Takahiro	e996060517	arm64: kdump: enable kdump in defconfig Kdump is enabled by default as kexec is. Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Conflicts: arch/arm64/configs/defconfig	2017-06-19 15:09:06 +09:00
AKASHI Takahiro	59e464c3ce	arm64: kdump: provide /proc/vmcore file Arch-specific functions are added to allow for implementing a crash dump file interface, /proc/vmcore, which can be viewed as a ELF file. A user space tool, like kexec-tools, is responsible for allocating a separate region for the core's ELF header within crash kdump kernel memory and filling it in when executing kexec_load(). Then, its location will be advertised to crash dump kernel via a new device-tree property, "linux,elfcorehdr", and crash dump kernel preserves the region for later use with reserve_elfcorehdr() at boot time. On crash dump kernel, /proc/vmcore will access the primary kernel's memory with copy_oldmem_page(), which feeds the data page-by-page by ioremap'ing it since it does not reside in linear mapping on crash dump kernel. Meanwhile, elfcorehdr_read() is simple as the region is always mapped. Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org> Reviewed-by: James Morse <james.morse@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Conflicts: arch/arm64/kernel/Makefile due to missing commit `214fad5507` ("arm64: relocation testing module")	2017-06-19 15:09:06 +09:00
AKASHI Takahiro	8c6faddfc7	arm64: kdump: add VMCOREINFO's for user-space tools In addition to common VMCOREINFO's defined in crash_save_vmcoreinfo_init(), we need to know, for crash utility, - kimage_voffset - PHYS_OFFSET to examine the contents of a dump file (/proc/vmcore) correctly due to the introduction of KASLR (CONFIG_RANDOMIZE_BASE) in v4.6. - VA_BITS is also required for makedumpfile command. arch_crash_save_vmcoreinfo() appends them to the dump file. More VMCOREINFO's may be added later. Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org> Reviewed-by: James Morse <james.morse@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2017-06-19 15:09:05 +09:00
AKASHI Takahiro	ce45ba16b9	arm64: kdump: implement machine_crash_shutdown() Primary kernel calls machine_crash_shutdown() to shut down non-boot cpus and save registers' status in per-cpu ELF notes before starting crash dump kernel. See kernel_kexec(). Even if not all secondary cpus have shut down, we do kdump anyway. As we don't have to make non-boot(crashed) cpus offline (to preserve correct status of cpus at crash dump) before shutting down, this patch also adds a variant of smp_send_stop(). Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org> Reviewed-by: James Morse <james.morse@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2017-06-19 15:09:05 +09:00
AKASHI Takahiro	76d3453b6b	arm64: hibernate: preserve kdump image around hibernation Since arch_kexec_protect_crashkres() removes a mapping for crash dump kernel image, the loaded data won't be preserved around hibernation. In this patch, helper functions, crash_prepare_suspend()/ crash_post_resume(), are additionally called before/after hibernation so that the relevant memory segments will be mapped again and preserved just as the others are. In addition, to minimize the size of hibernation image, crash_is_nosave() is added to pfn_is_nosave() in order to recognize only the pages that hold loaded crash dump kernel image as saveable. Hibernation excludes any pages that are marked as Reserved and yet "nosave." Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org> Reviewed-by: James Morse <james.morse@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Conflicts: arch/arm64/kernel/hibernate.c due to missing commit `8ec058fd27` ("arm64: hibernate: Resume when hibernate image created on non-boot CPU")	2017-06-19 15:09:04 +09:00
Takahiro Akashi	32dcc8e7a2	arm64: kdump: protect crash dump kernel memory arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres() are meant to be called by kexec_load() in order to protect the memory allocated for crash dump kernel once the image is loaded. The protection is implemented by unmapping the relevant segments in crash dump kernel memory, rather than making it read-only as other archs do, to prevent coherency issues due to potential cache aliasing (with mismatched attributes). Page-level mappings are consistently used here so that we can change the attributes of segments in page granularity as well as shrink the region also in page granularity through /sys/kernel/kexec_crash_size, putting the freed memory back to buddy system. Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org> Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Conflicts: arch/arm64/mm/mmu.c The file has been heavily refactored in v4.12, in particular, by commit 951366d4b7 ("arm64/mmu: replace 'page_mappings_only' parameter with flags argument") commit `f14c66ce81` ("arm64: mm: replace 'block_mappings_allowed' with 'page_mappings_only'") commit `5ea5306c32` ("arm64: alternatives: apply boot time fixups via the linear mapping")	2017-06-19 15:09:04 +09:00
AKASHI Takahiro	8d8d32ab8f	arm64: mm: add set_memory_valid() This function validates and invalidates PTE entries, and will be utilized in kdump to protect loaded crash dump kernel image. Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org> Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Conflicts: arch/arm64/mm/pageattr.c due to missing commit `83863f25e4` ("arm64: Add support for ARCH_SUPPORTS_DEBUG_PAGEALLOC")	2017-06-19 15:09:03 +09:00
AKASHI Takahiro	0d3db547e8	arm64: kdump: reserve memory for crash dump kernel "crashkernel=" kernel parameter specifies the size (and optionally the start address) of the system ram to be used by crash dump kernel. reserve_crashkernel() will allocate and reserve that memory at boot time of primary kernel. The memory range will be exposed to userspace as a resource named "Crash kernel" in /proc/iomem. Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org> Signed-off-by: Mark Salter <msalter@redhat.com> Signed-off-by: Pratyush Anand <panand@redhat.com> Reviewed-by: James Morse <james.morse@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Conflicts: arch/arm64/mm/init.c due to some included headers which are not to be here.	2017-06-19 15:09:02 +09:00
AKASHI Takahiro	a6a78623b7	arm64: limit memory regions based on DT property, usable-memory-range Crash dump kernel uses only a limited range of available memory as System RAM. On arm64 kdump, This memory range is advertised to crash dump kernel via a device-tree property under /chosen, linux,usable-memory-range = <BASE SIZE> Crash dump kernel reads this property at boot time and calls memblock_cap_memory_range() to limit usable memory which are listed either in UEFI memory map table or "memory" nodes of a device tree blob. Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org> Reviewed-by: Geoff Levand <geoff@infradead.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2017-06-19 15:09:02 +09:00
AKASHI Takahiro	8512530e7a	memblock: add memblock_cap_memory_range() Add memblock_cap_memory_range() which will remove all the memblock regions except the memory range specified in the arguments. In addition, rework is done on memblock_mem_limit_remove_map() to re-implement it using memblock_cap_memory_range(). This function, like memblock_mem_limit_remove_map(), will not remove memblocks with MEMMAP_NOMAP attribute as they may be mapped and accessed later as "device memory." See the commit `a571d4eb55` ("mm/memblock.c: add new infrastructure to address the mem limit issue"). This function is used, in a succeeding patch in the series of arm64 kdump suuport, to limit the range of usable memory, or System RAM, on crash dump kernel. (Please note that "mem=" parameter is of little use for this purpose.) Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org> Reviewed-by: Will Deacon <will.deacon@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Dennis Chen <dennis.chen@arm.com> Cc: linux-mm@kvack.org Cc: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2017-06-19 15:09:01 +09:00
AKASHI Takahiro	fc7b7e9917	memblock: add memblock_clear_nomap() This function, with a combination of memblock_mark_nomap(), will be used in a later kdump patch for arm64 when it temporarily isolates some range of memory from the other memory blocks in order to create a specific kernel mapping at boot time. Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org> Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2017-06-19 15:09:01 +09:00
Geoff Levand	a6bc75e1aa	arm64/kexec: Add pr_debug output To aid in debugging kexec problems or when adding new functionality to kexec add a new routine kexec_image_info() and several inline pr_debug statements. Signed-off-by: Geoff Levand <geoff@infradead.org> Reviewed-by: James Morse <james.morse@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2017-06-19 15:09:01 +09:00
Geoff Levand	5dbf54ceab	arm64/kexec: Enable kexec in the arm64 defconfig Signed-off-by: Geoff Levand <geoff@infradead.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Conflicts: arch/arm64/configs/defconfig due to missing commit `531d306731` ("arm64: defconfig: updates for 4.5")	2017-06-19 15:09:00 +09:00
Geoff Levand	32885538ff	arm64/kexec: Add core kexec support Add three new files, kexec.h, machine_kexec.c and relocate_kernel.S to the arm64 architecture that add support for the kexec re-boot mechanism (CONFIG_KEXEC) on arm64 platforms. Signed-off-by: Geoff Levand <geoff@infradead.org> Reviewed-by: James Morse <james.morse@arm.com> [catalin.marinas@arm.com: removed dead code following James Morse's comments] Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Conflicts: arch/arm64/Kconfig arch/arm64/kernel/Makefile Makefile was wrongly modified when merging: commit `762f201f0a` ("arm64: kernel: implement ACPI parking protocol")	2017-06-19 15:09:00 +09:00
Geoff Levand	46165e514b	arm64: Add back cpu reset routines Commit `68234df4ea` ("arm64: kill flush_cache_all()") removed the global arm64 routines cpu_reset() and cpu_soft_restart() needed by the arm64 kexec and kdump support. Add back a simplified version of cpu_soft_restart() with some changes needed for kexec in the new files cpu_reset.S, and cpu_reset.h. When a CPU is reset it needs to be put into the exception level it had when it entered the kernel. Update cpu_soft_restart() to accept an argument which signals if the reset address should be entered at EL1 or EL2, and add a new hypercall HVC_SOFT_RESTART which is used for the EL2 switch. Signed-off-by: Geoff Levand <geoff@infradead.org> Reviewed-by: James Morse <james.morse@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2017-06-19 15:08:59 +09:00
Ard Biesheuvel	3bbd245ee8	arm64: mm: add param to force create_pgd_mapping() to use page mappings Add a bool parameter 'allow_block_mappings' to create_pgd_mapping() and the various helper functions that it descends into, to give the caller control over whether block entries may be used to create the mapping. The UEFI runtime mapping routines will use this to avoid creating block entries that would need to split up into page entries when applying the permissions listed in the Memory Attributes firmware table. This also replaces the block_mappings_allowed() helper function that was added for DEBUG_PAGEALLOC functionality, but the resulting code is functionally equivalent (given that debug_page_alloc does not operate on EFI page table entries anyway) Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Conflicts: arch/arm64/kernel/efi.c The file has been heavily refactored.	2017-06-19 15:08:59 +09:00
Dennis Chen	520bf99d24	mm/memblock.c: add new infrastructure to address the mem limit issue In some cases, memblock is queried by kernel to determine whether a specified address is RAM or not. For example, the ACPI core needs this information to determine which attributes to use when mapping ACPI regions(acpi_os_ioremap). Use of incorrect memory types can result in faults, data corruption, or other issues. Removing memory with memblock_enforce_memory_limit() throws away this information, and so a kernel booted with 'mem=' may suffer from the issues described above. To avoid this, we need to keep those NOMAP regions instead of removing all above the limit, which preserves the information we need while preventing other use of those regions. This patch adds new infrastructure to retain all NOMAP memblock regions while removing others, to cater for this. Link: http://lkml.kernel.org/r/1468475036-5852-2-git-send-email-dennis.chen@arm.com Signed-off-by: Dennis Chen <dennis.chen@arm.com> Acked-by: Steve Capper <steve.capper@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Kaly Xin <kaly.xin@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-06-14 13:43:49 +09:00
James Morse	a88cf7061c	arm64: smp: Add function to determine if cpus are stuck in the kernel kernel/smp.c has a fancy counter that keeps track of the number of CPUs it marked as not-present and left in cpu_park_loop(). If there are any CPUs spinning in here, features like kexec or hibernate may release them by overwriting this memory. This problem also occurs on machines using spin-tables to release secondary cores. After commit `44dbcc93ab` ("arm64: Fix behavior of maxcpus=N") we bring all known cpus into the secondary holding pen, meaning this memory can't be re-used by kexec or hibernate. Add a function cpus_are_stuck_in_kernel() to determine if either of these cases have occurred. Signed-off-by: James Morse <james.morse@arm.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Conflicts: arch/arm64/include/asm/smp.h due to missing commit `17eebd1a43` ("arm64: Add cpu_panic_kernel helper")	2017-06-14 13:43:18 +09:00
Suzuki K Poulose	4a9d7e312d	arm64: Handle early CPU boot failures A secondary CPU could fail to come online due to insufficient capabilities and could simply die or loop in the kernel. e.g, a CPU with no support for the selected kernel PAGE_SIZE loops in kernel with MMU turned off. or a hotplugged CPU which doesn't have one of the advertised system capability will die during the activation. There is no way to synchronise the status of the failing CPU back to the master. This patch solves the issue by adding a field to the secondary_data which can be updated by the failing CPU. If the secondary CPU fails even before turning the MMU on, it updates the status in a special variable reserved in the head.txt section to make sure that the update can be cache invalidated safely without possible sharing of cache write back granule. Here are the possible states : -1. CPU_MMU_OFF - Initial value set by the master CPU, this value indicates that the CPU could not turn the MMU on, hence the status could not be reliably updated in the secondary_data. Instead, the CPU has updated the status @ __early_cpu_boot_status. 0. CPU_BOOT_SUCCESS - CPU has booted successfully. 1. CPU_KILL_ME - CPU has invoked cpu_ops->die, indicating the master CPU to synchronise by issuing a cpu_ops->cpu_kill. 2. CPU_STUCK_IN_KERNEL - CPU couldn't invoke die(), instead is looping in the kernel. This information could be used by say, kexec to check if it is really safe to do a kexec reboot. 3. CPU_PANIC_KERNEL - CPU detected some serious issues which requires kernel to crash immediately. The secondary CPU cannot call panic() until it has initialised the GIC. This flag can be used to instruct the master to do so. Cc: Mark Rutland <mark.rutland@arm.com> Acked-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> [catalin.marinas@arm.com: conflict resolution] [catalin.marinas@arm.com: converted "status" from int to long] [catalin.marinas@arm.com: updated update_early_cpu_boot_status to use str_l] Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2017-06-14 10:52:19 +09:00
Suzuki K Poulose	70733773ac	arm64: Move cpu_die_early to smp.c This patch moves cpu_die_early to smp.c, where it fits better. No functional changes, except for adding the necessary checks for CONFIG_HOTPLUG_CPU. Cc: Mark Rutland <mark.rutland@arm.com> Acked-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2017-06-14 10:52:06 +09:00
Suzuki K Poulose	8083cfe5b4	arm64: Introduce cpu_die_early Or in other words, make fail_incapable_cpu() reusable. We use fail_incapable_cpu() to kill a secondary CPU early during the bringup, which doesn't have the system advertised capabilities. This patch makes the routine more generic, to kill a secondary booting CPU, getting rid of the dependency on capability struct. This can be used by checks which are not necessarily attached to a capability struct (e.g, cpu ASIDBits). In that process, renames the function to cpu_die_early() to better match its functionality. This will be moved to arch/arm64/kernel/smp.c later. Cc: Mark Rutland <mark.rutland@arm.com> Acked-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2017-06-14 10:51:54 +09:00
Suzuki K Poulose	45a8334205	arm64: Add a helper for parking CPUs in a loop Adds a routine which can be used to park CPUs (spinning in kernel) when they can't be killed. Cc: Mark Rutland <mark.rutland@arm.com> Acked-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2017-06-14 10:51:43 +09:00
Joonsoo Kim	4cef0485a1	mm/slab: clean up DEBUG_PAGEALLOC processing code Currently, open code for checking DEBUG_PAGEALLOC cache is spread to some sites. It makes code unreadable and hard to change. This patch cleans up this code. The following patch will change the criteria for DEBUG_PAGEALLOC cache so this clean-up will help it, too. [akpm@linux-foundation.org: fix build with CONFIG_DEBUG_PAGEALLOC=n] Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-06-14 10:51:30 +09:00
Joonsoo Kim	f7cd74fb49	mm/slab: use more appropriate condition check for debug_pagealloc debug_pagealloc debugging is related to SLAB_POISON flag rather than FORCED_DEBUG option, although FORCED_DEBUG option will enable SLAB_POISON. Fix it. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-06-14 10:51:20 +09:00
Joonsoo Kim	9452a80ce4	mm/slab: activate debug_pagealloc in SLAB when it is actually enabled Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-06-14 10:51:06 +09:00
Yaowei Bai	3cd7d7814b	mm/memblock.c: memblock_is_memory()/reserved() can be boolean Make memblock_is_memory() and memblock_is_reserved return bool to improve readability due to these particular functions only using either one or zero as their return value. No functional change. Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-06-14 10:50:54 +09:00
Ard Biesheuvel	02dd1897be	mm/memblock: add MEMBLOCK_NOMAP attribute to memblock memory table This introduces the MEMBLOCK_NOMAP attribute and the required plumbing to make it usable as an indicator that some parts of normal memory should not be covered by the kernel direct mapping. It is up to the arch to actually honor the attribute when laying out this mapping, but the memblock code itself is modified to disregard these regions for allocations and other general use. Cc: linux-mm@kvack.org Cc: Alexander Kuleshov <kuleshovmail@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Will Deacon <will.deacon@arm.com>	2017-06-14 10:50:39 +09:00
Alex Shi	dfb9fdffce	Merge tag 'v4.4.71' into linux-linaro-lsk-v4.4 This is the 4.4.71 stable release	2017-06-08 12:11:49 +08:00
Greg Kroah-Hartman	4bbbc76964	Linux 4.4.71	2017-06-07 12:06:14 +02:00
Eric Sandeen	9d65be36a7	xfs: only return -errno or success from attr ->put_listent commit `2a6fba6d23` upstream. Today, the put_listent formatters return either 1 or 0; if they return 1, some callers treat this as an error and return it up the stack, despite "1" not being a valid (negative) error code. The intent seems to be that if the input buffer is full, we set seen_enough or set count = -1, and return 1; but some callers check the return before checking the seen_enough or count fields of the context. Fix this by only returning non-zero for actual errors encountered, and rely on the caller to first check the return value, then check the values in the context to decide what to do. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2017-06-07 12:06:03 +02:00
Darrick J. Wong	1b03d85a4f	xfs: in _attrlist_by_handle, copy the cursor back to userspace commit `0facef7fb0` upstream. When we're iterating inode xattrs by handle, we have to copy the cursor back to userspace so that a subsequent invocation actually retrieves subsequent contents. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2017-06-07 12:06:03 +02:00
Eric Sandeen	c56605c69b	xfs: fix unaligned access in xfs_btree_visit_blocks commit `a4d768e702` upstream. This structure copy was throwing unaligned access warnings on sparc64: Kernel unaligned access at TPC[1043c088] xfs_btree_visit_blocks+0x88/0xe0 [xfs] xfs_btree_copy_ptrs does a memcpy, which avoids it. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2017-06-07 12:06:03 +02:00
Zorro Lang	9f7b5da057	xfs: bad assertion for delalloc an extent that start at i_size commit `892d2a5f70` upstream. By run fsstress long enough time enough in RHEL-7, I find an assertion failure (harder to reproduce on linux-4.11, but problem is still there): XFS: Assertion failed: (iflags & BMV_IF_DELALLOC) != 0, file: fs/xfs/xfs_bmap_util.c The assertion is in xfs_getbmap() funciton: if (map[i].br_startblock == DELAYSTARTBLOCK && --> map[i].br_startoff <= XFS_B_TO_FSB(mp, XFS_ISIZE(ip))) ASSERT((iflags & BMV_IF_DELALLOC) != 0); When map[i].br_startoff == XFS_B_TO_FSB(mp, XFS_ISIZE(ip)), the startoff is just at EOF. But we only need to make sure delalloc extents that are within EOF, not include EOF. Signed-off-by: Zorro Lang <zlang@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2017-06-07 12:06:02 +02:00
Brian Foster	3ba13d7f5b	xfs: fix indlen accounting error on partial delalloc conversion commit `0daaecacb8` upstream. The delalloc -> real block conversion path uses an incorrect calculation in the case where the middle part of a delalloc extent is being converted. This is documented as a rare situation because XFS generally attempts to maximize contiguity by converting as much of a delalloc extent as possible. If this situation does occur, the indlen reservation for the two new delalloc extents left behind by the conversion of the middle range is calculated and compared with the original reservation. If more blocks are required, the delta is allocated from the global block pool. This delta value can be characterized as the difference between the new total requirement (temp + temp2) and the currently available reservation minus those blocks that have already been allocated (startblockval(PREV.br_startblock) - allocated). The problem is that the current code does not account for previously allocated blocks correctly. It subtracts the current allocation count from the (new - old) delta rather than the old indlen reservation. This means that more indlen blocks than have been allocated end up stashed in the remaining extents and free space accounting is broken as a result. Fix up the calculation to subtract the allocated block count from the original extent indlen and thus correctly allocate the reservation delta based on the difference between the new total requirement and the unused blocks from the original reservation. Also remove a bogus assert that contradicts the fact that the new indlen reservation can be larger than the original indlen reservation. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2017-06-07 12:06:02 +02:00
Brian Foster	1d41dd5c1f	xfs: wait on new inodes during quotaoff dquot release commit `e20c8a517f` upstream. The quotaoff operation has a race with inode allocation that results in a livelock. An inode allocation that occurs before the quota status flags are updated acquires the appropriate dquots for the inode via xfs_qm_vop_dqalloc(). It then inserts the XFS_INEW inode into the perag radix tree, sometime later attaches the dquots to the inode and finally clears the XFS_INEW flag. Quotaoff expects to release the dquots from all inodes in the filesystem via xfs_qm_dqrele_all_inodes(). This invokes the AG inode iterator, which skips inodes in the XFS_INEW state because they are not fully constructed. If the scan occurs after dquots have been attached to an inode, but before XFS_INEW is cleared, the newly allocated inode will continue to hold a reference to the applicable dquots. When quotaoff invokes xfs_qm_dqpurge_all(), the reference count of those dquot(s) remain elevated and the dqpurge scan spins indefinitely. To address this problem, update the xfs_qm_dqrele_all_inodes() scan to wait on inodes marked on the XFS_INEW state. We wait on the inodes explicitly rather than skip and retry to avoid continuous retry loops due to a parallel inode allocation workload. Since quotaoff updates the quota state flags and uses a synchronous transaction before the dqrele scan, and dquots are attached to inodes after radix tree insertion iff quota is enabled, one INEW waiting pass through the AG guarantees that the scan has processed all inodes that could possibly hold dquot references. Reported-by: Eryu Guan <eguan@redhat.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2017-06-07 12:06:02 +02:00
Brian Foster	9d97d6a152	xfs: update ag iterator to support wait on new inodes commit `ae2c4ac2dd` upstream. The AG inode iterator currently skips new inodes as such inodes are inserted into the inode radix tree before they are fully constructed. Certain contexts require the ability to wait on the construction of new inodes, however. The fs-wide dquot release from the quotaoff sequence is an example of this. Update the AG inode iterator to support the ability to wait on inodes flagged with XFS_INEW upon request. Create a new xfs_inode_ag_iterator_flags() interface and support a set of iteration flags to modify the iteration behavior. When the XFS_AGITER_INEW_WAIT flag is set, include XFS_INEW flags in the radix tree inode lookup and wait on them before the callback is executed. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2017-06-07 12:06:02 +02:00
Brian Foster	8e25af0dc5	xfs: support ability to wait on new inodes commit `756baca27f` upstream. Inodes that are inserted into the perag tree but still under construction are flagged with the XFS_INEW bit. Most contexts either skip such inodes when they are encountered or have the ability to handle them. The runtime quotaoff sequence introduces a context that must wait for construction of such inodes to correctly ensure that all dquots in the fs are released. In anticipation of this, support the ability to wait on new inodes. Wake the appropriate bit when XFS_INEW is cleared. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2017-06-07 12:06:02 +02:00
Brian Foster	cf55c35974	xfs: fix up quotacheck buffer list error handling commit `20e8a06378` upstream. The quotacheck error handling of the delwri buffer list assumes the resident buffers are locked and doesn't clear the _XBF_DELWRI_Q flag on the buffers that are dequeued. This can lead to assert failures on buffer release and possibly other locking problems. Move this code to a delwri queue cancel helper function to encapsulate the logic required to properly release buffers from a delwri queue. Update the helper to clear the delwri queue flag and call it from quotacheck. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2017-06-07 12:06:02 +02:00
Brian Foster	a76647a71c	xfs: prevent multi-fsb dir readahead from reading random blocks commit `cb52ee334a` upstream. Directory block readahead uses a complex iteration mechanism to map between high-level directory blocks and underlying physical extents. This mechanism attempts to traverse the higher-level dir blocks in a manner that handles multi-fsb directory blocks and simultaneously maintains a reference to the corresponding physical blocks. This logic doesn't handle certain (discontiguous) physical extent layouts correctly with multi-fsb directory blocks. For example, consider the case of a 4k FSB filesystem with a 2 FSB (8k) directory block size and a directory with the following extent layout: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..7]: 88..95 0 (88..95) 8 1: [8..15]: 80..87 0 (80..87) 8 2: [16..39]: 168..191 0 (168..191) 24 3: [40..63]: 5242952..5242975 1 (72..95) 24 Directory block 0 spans physical extents 0 and 1, dirblk 1 lies entirely within extent 2 and dirblk 2 spans extents 2 and 3. Because extent 2 is larger than the directory block size, the readahead code erroneously assumes the block is contiguous and issues a readahead based on the physical mapping of the first fsb of the dirblk. This results in read verifier failure and a spurious corruption or crc failure, depending on the filesystem format. Further, the subsequent readahead code responsible for walking through the physical table doesn't correctly advance the physical block reference for dirblk 2. Instead of advancing two physical filesystem blocks, the first iteration of the loop advances 1 block (correctly), but the subsequent iteration advances 2 more physical blocks because the next physical extent (extent 3, above) happens to cover more than dirblk 2. At this point, the higher-level directory block walking is completely off the rails of the actual physical layout of the directory for the respective mapping table. Update the contiguous dirblock logic to consider the current offset in the physical extent to avoid issuing directory readahead to unrelated blocks. Also, update the mapping table advancing code to consider the current offset within the current dirblock to avoid advancing the mapping reference too far beyond the dirblock. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2017-06-07 12:06:02 +02:00
Eric Sandeen	8caa9a54b3	xfs: handle array index overrun in xfs_dir2_leaf_readbuf() commit `023cc840b4` upstream. Carlos had a case where "find" seemed to start spinning forever and never return. This was on a filesystem with non-default multi-fsb (8k) directory blocks, and a fragmented directory with extents like this: 0:[0,133646,2,0] 1:[2,195888,1,0] 2:[3,195890,1,0] 3:[4,195892,1,0] 4:[5,195894,1,0] 5:[6,195896,1,0] 6:[7,195898,1,0] 7:[8,195900,1,0] 8:[9,195902,1,0] 9:[10,195908,1,0] 10:[11,195910,1,0] 11:[12,195912,1,0] 12:[13,195914,1,0] ... i.e. the first extent is a contiguous 2-fsb dir block, but after that it is fragmented into 1 block extents. At the top of the readdir path, we allocate a mapping array which (for this filesystem geometry) can hold 10 extents; see the assignment to map_info->map_size. During readdir, we are therefore able to map extents 0 through 9 above into the array for readahead purposes. If we count by 2, we see that the last mapped index (9) is the first block of a 2-fsb directory block. At the end of xfs_dir2_leaf_readbuf() we have 2 loops to fill more readahead; the outer loop assumes one full dir block is processed each loop iteration, and an inner loop that ensures that this is so by advancing to the next extent until a full directory block is mapped. The problem is that this inner loop may step past the last extent in the mapping array as it tries to reach the end of the directory block. This will read garbage for the extent length, and as a result the loop control variable 'j' may become corrupted and never fail the loop conditional. The number of valid mappings we have in our array is stored in map->map_valid, so stop this inner loop based on that limit. There is an ASSERT at the top of the outer loop for this same condition, but we never made it out of the inner loop, so the ASSERT never fired. Huge appreciation for Carlos for debugging and isolating the problem. Debugged-and-analyzed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Tested-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2017-06-07 12:06:02 +02:00
Darrick J. Wong	0ace12c114	xfs: fix over-copying of getbmap parameters from userspace commit `be6324c00c` upstream. In xfs_ioc_getbmap, we should only copy the fields of struct getbmap from userspace, or else we end up copying random stack contents into the kernel. struct getbmap is a strict subset of getbmapx, so a partial structure copy should work fine. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2017-06-07 12:06:02 +02:00
Eryu Guan	fe705621b9	xfs: fix off-by-one on max nr_pages in xfs_find_get_desired_pgoff() commit `8affebe16d` upstream. xfs_find_get_desired_pgoff() is used to search for offset of hole or data in page range [index, end] (both inclusive), and the max number of pages to search should be at least one, if end == index. Otherwise the only page is missed and no hole or data is found, which is not correct. When block size is smaller than page size, this can be demonstrated by preallocating a file with size smaller than page size and writing data to the last block. E.g. run this xfs_io command on a 1k block size XFS on x86_64 host. # xfs_io -fc "falloc 0 3k" -c "pwrite 2k 1k" \ -c "seek -d 0" /mnt/xfs/testfile wrote 1024/1024 bytes at offset 2048 1 KiB, 1 ops; 0.0000 sec (33.675 MiB/sec and 34482.7586 ops/sec) Whence Result DATA EOF Data at offset 2k was missed, and lseek(2) returned ENXIO. This is uncovered by generic/285 subtest 07 and 08 on ppc64 host, where pagesize is 64k. Because a recent change to generic/285 reduced the preallocated file size to smaller than 64k. Signed-off-by: Eryu Guan <eguan@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2017-06-07 12:06:01 +02:00
Jan Kara	b9a7816997	xfs: Fix missed holes in SEEK_HOLE implementation commit `5375023ae1` upstream. XFS SEEK_HOLE implementation could miss a hole in an unwritten extent as can be seen by the following command: xfs_io -c "falloc 0 256k" -c "pwrite 0 56k" -c "pwrite 128k 8k" -c "seek -h 0" file wrote 57344/57344 bytes at offset 0 56 KiB, 14 ops; 0.0000 sec (49.312 MiB/sec and 12623.9856 ops/sec) wrote 8192/8192 bytes at offset 131072 8 KiB, 2 ops; 0.0000 sec (70.383 MiB/sec and 18018.0180 ops/sec) Whence Result HOLE 139264 Where we can see that hole at offset 56k was just ignored by SEEK_HOLE implementation. The bug is in xfs_find_get_desired_pgoff() which does not properly detect the case when pages are not contiguous. Fix the problem by properly detecting when found page has larger offset than expected. Fixes: `d126d43f63` Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2017-06-07 12:06:01 +02:00
Yisheng Xie	03489bfc78	mlock: fix mlock count can not decrease in race condition commit `70feee0e1e` upstream. Kefeng reported that when running the follow test, the mlock count in meminfo will increase permanently: [1] testcase linux:~ # cat test_mlockal grep Mlocked /proc/meminfo for j in `seq 0 10` do for i in `seq 4 15` do ./p_mlockall >> log & done sleep 0.2 done # wait some time to let mlock counter decrease and 5s may not enough sleep 5 grep Mlocked /proc/meminfo linux:~ # cat p_mlockall.c #include <sys/mman.h> #include <stdlib.h> #include <stdio.h> #define SPACE_LEN 4096 int main(int argc, char ** argv) { int ret; void *adr = malloc(SPACE_LEN); if (!adr) return -1; ret = mlockall(MCL_CURRENT \| MCL_FUTURE); printf("mlcokall ret = %d\n", ret); ret = munlockall(); printf("munlcokall ret = %d\n", ret); free(adr); return 0; } In __munlock_pagevec() we should decrement NR_MLOCK for each page where we clear the PageMlocked flag. Commit `1ebb7cc6a5` ("mm: munlock: batch NR_MLOCK zone state updates") has introduced a bug where we don't decrement NR_MLOCK for pages where we clear the flag, but fail to isolate them from the lru list (e.g. when the pages are on some other cpu's percpu pagevec). Since PageMlocked stays cleared, the NR_MLOCK accounting gets permanently disrupted by this. Fix it by counting the number of page whose PageMlock flag is cleared. Fixes: `1ebb7cc6a5` (" mm: munlock: batch NR_MLOCK zone state updates") Link: http://lkml.kernel.org/r/1495678405-54569-1-git-send-email-xieyisheng1@huawei.com Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com> Reported-by: Kefeng Wang <wangkefeng.wang@huawei.com> Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Joern Engel <joern@logfs.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: zhongjiang <zhongjiang@huawei.com> Cc: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2017-06-07 12:06:01 +02:00
Punit Agrawal	7e13bab109	mm/migrate: fix refcount handling when !hugepage_migration_supported() commit `30809f559a` upstream. On failing to migrate a page, soft_offline_huge_page() performs the necessary update to the hugepage ref-count. But when !hugepage_migration_supported() , unmap_and_move_hugepage() also decrements the page ref-count for the hugepage. The combined behaviour leaves the ref-count in an inconsistent state. This leads to soft lockups when running the overcommitted hugepage test from mce-tests suite. Soft offlining pfn 0x83ed600 at process virtual address 0x400000000000 soft offline: 0x83ed600: migration failed 1, type 1fffc00000008008 (uptodate\|head) INFO: rcu_preempt detected stalls on CPUs/tasks: Tasks blocked on level-0 rcu_node (CPUs 0-7): P2715 (detected by 7, t=5254 jiffies, g=963, c=962, q=321) thugetlb_overco R running task 0 2715 2685 0x00000008 Call trace: dump_backtrace+0x0/0x268 show_stack+0x24/0x30 sched_show_task+0x134/0x180 rcu_print_detail_task_stall_rnp+0x54/0x7c rcu_check_callbacks+0xa74/0xb08 update_process_times+0x34/0x60 tick_sched_handle.isra.7+0x38/0x70 tick_sched_timer+0x4c/0x98 __hrtimer_run_queues+0xc0/0x300 hrtimer_interrupt+0xac/0x228 arch_timer_handler_phys+0x3c/0x50 handle_percpu_devid_irq+0x8c/0x290 generic_handle_irq+0x34/0x50 __handle_domain_irq+0x68/0xc0 gic_handle_irq+0x5c/0xb0 Address this by changing the putback_active_hugepage() in soft_offline_huge_page() to putback_movable_pages(). This only triggers on systems that enable memory failure handling (ARCH_SUPPORTS_MEMORY_FAILURE) but not hugepage migration (!ARCH_ENABLE_HUGEPAGE_MIGRATION). I imagine this wasn't triggered as there aren't many systems running this configuration. [akpm@linux-foundation.org: remove dead comment, per Naoya] Link: http://lkml.kernel.org/r/20170525135146.32011-1-punit.agrawal@arm.com Reported-by: Manoj Iyer <manoj.iyer@canonical.com> Tested-by: Manoj Iyer <manoj.iyer@canonical.com> Suggested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Punit Agrawal <punit.agrawal@arm.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2017-06-07 12:06:01 +02:00

1 2 3 4 5 ...

567546 Commits