Commit Graph

642436 Commits

Author SHA1 Message Date
Greg Kroah-Hartman
bc7ff9b998 Merge 4.9.75 into android-4.9
Changes in 4.9.75
	tcp_bbr: reset full pipe detection on loss recovery undo
	tcp_bbr: reset long-term bandwidth sampling on loss recovery undo
	x86/boot: Add early cmdline parsing for options with arguments
	KAISER: Kernel Address Isolation
	kaiser: merged update
	kaiser: do not set _PAGE_NX on pgd_none
	kaiser: stack map PAGE_SIZE at THREAD_SIZE-PAGE_SIZE
	kaiser: fix build and FIXME in alloc_ldt_struct()
	kaiser: KAISER depends on SMP
	kaiser: fix regs to do_nmi() ifndef CONFIG_KAISER
	kaiser: fix perf crashes
	kaiser: ENOMEM if kaiser_pagetable_walk() NULL
	kaiser: tidied up asm/kaiser.h somewhat
	kaiser: tidied up kaiser_add/remove_mapping slightly
	kaiser: align addition to x86/mm/Makefile
	kaiser: cleanups while trying for gold link
	kaiser: name that 0x1000 KAISER_SHADOW_PGD_OFFSET
	kaiser: delete KAISER_REAL_SWITCH option
	kaiser: vmstat show NR_KAISERTABLE as nr_overhead
	kaiser: enhanced by kernel and user PCIDs
	kaiser: load_new_mm_cr3() let SWITCH_USER_CR3 flush user
	kaiser: PCID 0 for kernel and 128 for user
	kaiser: x86_cr3_pcid_noflush and x86_cr3_pcid_user
	kaiser: paranoid_entry pass cr3 need to paranoid_exit
	kaiser: kaiser_remove_mapping() move along the pgd
	kaiser: fix unlikely error in alloc_ldt_struct()
	kaiser: add "nokaiser" boot option, using ALTERNATIVE
	x86/kaiser: Rename and simplify X86_FEATURE_KAISER handling
	x86/kaiser: Check boottime cmdline params
	kaiser: use ALTERNATIVE instead of x86_cr3_pcid_noflush
	kaiser: drop is_atomic arg to kaiser_pagetable_walk()
	kaiser: asm/tlbflush.h handle noPGE at lower level
	kaiser: kaiser_flush_tlb_on_return_to_user() check PCID
	x86/paravirt: Dont patch flush_tlb_single
	x86/kaiser: Reenable PARAVIRT
	kaiser: disabled on Xen PV
	x86/kaiser: Move feature detection up
	KPTI: Rename to PAGE_TABLE_ISOLATION
	KPTI: Report when enabled
	kaiser: Set _PAGE_NX only if supported
	Linux 4.9.75

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-01-05 22:28:25 +01:00
Greg Kroah-Hartman
9f74755895 Linux 4.9.75 2018-01-05 15:46:36 +01:00
Guenter Roeck
92fd81f772 kaiser: Set _PAGE_NX only if supported
This resolves a crash if loaded under qemu + haxm under windows.
See https://www.spinics.net/lists/kernel/msg2689835.html for details.
Here is a boot log (the log is from chromeos-4.4, but Tao Wu says that
the same log is also seen with vanilla v4.4.110-rc1).

[    0.712750] Freeing unused kernel memory: 552K
[    0.721821] init: Corrupted page table at address 57b029b332e0
[    0.722761] PGD 80000000bb238067 PUD bc36a067 PMD bc369067 PTE 45d2067
[    0.722761] Bad pagetable: 000b [#1] PREEMPT SMP 
[    0.722761] Modules linked in:
[    0.722761] CPU: 1 PID: 1 Comm: init Not tainted 4.4.96 #31
[    0.722761] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014
[    0.722761] task: ffff8800bc290000 ti: ffff8800bc28c000 task.ti: ffff8800bc28c000
[    0.722761] RIP: 0010:[<ffffffff83f4129e>]  [<ffffffff83f4129e>] __clear_user+0x42/0x67
[    0.722761] RSP: 0000:ffff8800bc28fcf8  EFLAGS: 00010202
[    0.722761] RAX: 0000000000000000 RBX: 00000000000001a4 RCX: 00000000000001a4
[    0.722761] RDX: 0000000000000000 RSI: 0000000000000008 RDI: 000057b029b332e0
[    0.722761] RBP: ffff8800bc28fd08 R08: ffff8800bc290000 R09: ffff8800bb2f4000
[    0.722761] R10: ffff8800bc290000 R11: ffff8800bb2f4000 R12: 000057b029b332e0
[    0.722761] R13: 0000000000000000 R14: 000057b029b33340 R15: ffff8800bb1e2a00
[    0.722761] FS:  0000000000000000(0000) GS:ffff8800bfb00000(0000) knlGS:0000000000000000
[    0.722761] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[    0.722761] CR2: 000057b029b332e0 CR3: 00000000bb2f8000 CR4: 00000000000006e0
[    0.722761] Stack:
[    0.722761]  000057b029b332e0 ffff8800bb95fa80 ffff8800bc28fd18 ffffffff83f4120c
[    0.722761]  ffff8800bc28fe18 ffffffff83e9e7a1 ffff8800bc28fd68 0000000000000000
[    0.722761]  ffff8800bc290000 ffff8800bc290000 ffff8800bc290000 ffff8800bc290000
[    0.722761] Call Trace:
[    0.722761]  [<ffffffff83f4120c>] clear_user+0x2e/0x30
[    0.722761]  [<ffffffff83e9e7a1>] load_elf_binary+0xa7f/0x18f7
[    0.722761]  [<ffffffff83de2088>] search_binary_handler+0x86/0x19c
[    0.722761]  [<ffffffff83de389e>] do_execveat_common.isra.26+0x909/0xf98
[    0.722761]  [<ffffffff844febe0>] ? rest_init+0x87/0x87
[    0.722761]  [<ffffffff83de40be>] do_execve+0x23/0x25
[    0.722761]  [<ffffffff83c002e3>] run_init_process+0x2b/0x2d
[    0.722761]  [<ffffffff844fec4d>] kernel_init+0x6d/0xda
[    0.722761]  [<ffffffff84505b2f>] ret_from_fork+0x3f/0x70
[    0.722761]  [<ffffffff844febe0>] ? rest_init+0x87/0x87
[    0.722761] Code: 86 84 be 12 00 00 00 e8 87 0d e8 ff 66 66 90 48 89 d8 48 c1
eb 03 4c 89 e7 83 e0 07 48 89 d9 be 08 00 00 00 31 d2 48 85 c9 74 0a <48> 89 17
48 01 f7 ff c9 75 f6 48 89 c1 85 c9 74 09 88 17 48 ff 
[    0.722761] RIP  [<ffffffff83f4129e>] __clear_user+0x42/0x67
[    0.722761]  RSP <ffff8800bc28fcf8>
[    0.722761] ---[ end trace def703879b4ff090 ]---
[    0.722761] BUG: sleeping function called from invalid context at /mnt/host/source/src/third_party/kernel/v4.4/kernel/locking/rwsem.c:21
[    0.722761] in_atomic(): 0, irqs_disabled(): 1, pid: 1, name: init
[    0.722761] CPU: 1 PID: 1 Comm: init Tainted: G      D         4.4.96 #31
[    0.722761] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014
[    0.722761]  0000000000000086 dcb5d76098c89836 ffff8800bc28fa30 ffffffff83f34004
[    0.722761]  ffffffff84839dc2 0000000000000015 ffff8800bc28fa40 ffffffff83d57dc9
[    0.722761]  ffff8800bc28fa68 ffffffff83d57e6a ffffffff84a53640 0000000000000000
[    0.722761] Call Trace:
[    0.722761]  [<ffffffff83f34004>] dump_stack+0x4d/0x63
[    0.722761]  [<ffffffff83d57dc9>] ___might_sleep+0x13a/0x13c
[    0.722761]  [<ffffffff83d57e6a>] __might_sleep+0x9f/0xa6
[    0.722761]  [<ffffffff84502788>] down_read+0x20/0x31
[    0.722761]  [<ffffffff83cc5d9b>] __blocking_notifier_call_chain+0x35/0x63
[    0.722761]  [<ffffffff83cc5ddd>] blocking_notifier_call_chain+0x14/0x16
[    0.800374] usb 1-1: new full-speed USB device number 2 using uhci_hcd
[    0.722761]  [<ffffffff83cefe97>] profile_task_exit+0x1a/0x1c
[    0.802309]  [<ffffffff83cac84e>] do_exit+0x39/0xe7f
[    0.802309]  [<ffffffff83ce5938>] ? vprintk_default+0x1d/0x1f
[    0.802309]  [<ffffffff83d7bb95>] ? printk+0x57/0x73
[    0.802309]  [<ffffffff83c46e25>] oops_end+0x80/0x85
[    0.802309]  [<ffffffff83c7b747>] pgtable_bad+0x8a/0x95
[    0.802309]  [<ffffffff83ca7f4a>] __do_page_fault+0x8c/0x352
[    0.802309]  [<ffffffff83eefba5>] ? file_has_perm+0xc4/0xe5
[    0.802309]  [<ffffffff83ca821c>] do_page_fault+0xc/0xe
[    0.802309]  [<ffffffff84507682>] page_fault+0x22/0x30
[    0.802309]  [<ffffffff83f4129e>] ? __clear_user+0x42/0x67
[    0.802309]  [<ffffffff83f4127f>] ? __clear_user+0x23/0x67
[    0.802309]  [<ffffffff83f4120c>] clear_user+0x2e/0x30
[    0.802309]  [<ffffffff83e9e7a1>] load_elf_binary+0xa7f/0x18f7
[    0.802309]  [<ffffffff83de2088>] search_binary_handler+0x86/0x19c
[    0.802309]  [<ffffffff83de389e>] do_execveat_common.isra.26+0x909/0xf98
[    0.802309]  [<ffffffff844febe0>] ? rest_init+0x87/0x87
[    0.802309]  [<ffffffff83de40be>] do_execve+0x23/0x25
[    0.802309]  [<ffffffff83c002e3>] run_init_process+0x2b/0x2d
[    0.802309]  [<ffffffff844fec4d>] kernel_init+0x6d/0xda
[    0.802309]  [<ffffffff84505b2f>] ret_from_fork+0x3f/0x70
[    0.802309]  [<ffffffff844febe0>] ? rest_init+0x87/0x87
[    0.830559] Kernel panic - not syncing: Attempted to kill init!  exitcode=0x00000009
[    0.830559] 
[    0.831305] Kernel Offset: 0x2c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[    0.831305] ---[ end Kernel panic - not syncing: Attempted to kill init!  exitcode=0x00000009

The crash part of this problem may be solved with the following patch
(thanks to Hugh for the hint). There is still another problem, though -
with this patch applied, the qemu session aborts with "VCPU Shutdown
request", whatever that means.

Cc: lepton <ytht.net@gmail.com>
Signed-off-by: Guenter Roeck <groeck@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:36 +01:00
Kees Cook
ea6cd39d23 KPTI: Report when enabled
Make sure dmesg reports when KPTI is enabled.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:36 +01:00
Kees Cook
e71fac0172 KPTI: Rename to PAGE_TABLE_ISOLATION
This renames CONFIG_KAISER to CONFIG_PAGE_TABLE_ISOLATION.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:35 +01:00
Borislav Petkov
59094faf3f x86/kaiser: Move feature detection up
... before the first use of kaiser_enabled as otherwise funky
things happen:

  about to get started...
  (XEN) d0v0 Unhandled page fault fault/trap [#14, ec=0000]
  (XEN) Pagetable walk from ffff88022a449090:
  (XEN)  L4[0x110] = 0000000229e0e067 0000000000001e0e
  (XEN)  L3[0x008] = 0000000000000000 ffffffffffffffff
  (XEN) domain_crash_sync called from entry.S: fault at ffff82d08033fd08
  entry.o#create_bounce_frame+0x135/0x14d
  (XEN) Domain 0 (vcpu#0) crashed on cpu#0:
  (XEN) ----[ Xen-4.9.1_02-3.21  x86_64  debug=n   Not tainted ]----
  (XEN) CPU:    0
  (XEN) RIP:    e033:[<ffffffff81007460>]
  (XEN) RFLAGS: 0000000000000286   EM: 1   CONTEXT: pv guest (d0v0)

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:35 +01:00
Jiri Kosina
402e63de94 kaiser: disabled on Xen PV
Kaiser cannot be used on paravirtualized MMUs (namely reading and writing CR3).
This does not work with KAISER as the CR3 switch from and to user space PGD
would require to map the whole XEN_PV machinery into both.

More importantly, enabling KAISER on Xen PV doesn't make too much sense, as PV
guests use distinct %cr3 values for kernel and user already.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:35 +01:00
Borislav Petkov
2c2721754a x86/kaiser: Reenable PARAVIRT
Now that the required bits have been addressed, reenable
PARAVIRT.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:35 +01:00
Thomas Gleixner
1817d2c2fa x86/paravirt: Dont patch flush_tlb_single
commit a035795499 upstream

native_flush_tlb_single() will be changed with the upcoming
PAGE_TABLE_ISOLATION feature. This requires to have more code in
there than INVLPG.

Remove the paravirt patching for it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Cc: michael.schwarz@iaik.tugraz.at
Cc: moritz.lipp@iaik.tugraz.at
Cc: richard.fellner@student.tugraz.at
Link: https://lkml.kernel.org/r/20171204150606.828111617@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:35 +01:00
Hugh Dickins
fe5cb75fd2 kaiser: kaiser_flush_tlb_on_return_to_user() check PCID
Let kaiser_flush_tlb_on_return_to_user() do the X86_FEATURE_PCID
check, instead of each caller doing it inline first: nobody needs
to optimize for the noPCID case, it's clearer this way, and better
suits later changes.  Replace those no-op X86_CR3_PCID_KERN_FLUSH lines
by a BUILD_BUG_ON() in load_new_mm_cr3(), in case something changes.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:35 +01:00
Hugh Dickins
b72c26e911 kaiser: asm/tlbflush.h handle noPGE at lower level
I found asm/tlbflush.h too twisty, and think it safer not to avoid
__native_flush_tlb_global_irq_disabled() in the kaiser_enabled case,
but instead let it handle kaiser_enabled along with cr3: it can just
use __native_flush_tlb() for that, no harm in re-disabling preemption.

(This is not the same change as Kirill and Dave have suggested for
upstream, flipping PGE in cr4: that's neat, but needs a cpu_has_pge
check; cr3 is enough for kaiser, and thought to be cheaper than cr4.)

Also delete the X86_FEATURE_INVPCID invpcid_flush_all_nonglobals()
preference from __native_flush_tlb(): unlike the invpcid_flush_all()
preference in __native_flush_tlb_global(), it's not seen in upstream
4.14, and was recently reported to be surprisingly slow.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:35 +01:00
Hugh Dickins
8c2f8a5cc1 kaiser: drop is_atomic arg to kaiser_pagetable_walk()
I have not observed a might_sleep() warning from setup_fixmap_gdt()'s
use of kaiser_add_mapping() in our tree (why not?), but like upstream
we have not provided a way for that to pass is_atomic true down to
kaiser_pagetable_walk(), and at startup it's far from a likely source
of trouble: so just delete the walk's is_atomic arg and might_sleep().

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:35 +01:00
Hugh Dickins
169b369f99 kaiser: use ALTERNATIVE instead of x86_cr3_pcid_noflush
Now that we're playing the ALTERNATIVE game, use that more efficient
method: instead of user-mapping an extra page, and reading an extra
cacheline each time for x86_cr3_pcid_noflush.

Neel has found that __stringify(bts $X86_CR3_PCID_NOFLUSH_BIT, %rax)
is a working substitute for the "bts $63, %rax" in these ALTERNATIVEs;
but the one line with $63 in looks clearer, so let's stick with that.

Worried about what happens with an ALTERNATIVE between the jump and
jump label in another ALTERNATIVE?  I was, but have checked the
combinations in SWITCH_KERNEL_CR3_NO_STACK at entry_SYSCALL_64,
and it does a good job.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:35 +01:00
Borislav Petkov
8018307a45 x86/kaiser: Check boottime cmdline params
AMD (and possibly other vendors) are not affected by the leak
KAISER is protecting against.

Keep the "nopti" for traditional reasons and add pti=<on|off|auto>
like upstream.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:35 +01:00
Borislav Petkov
50624dd12d x86/kaiser: Rename and simplify X86_FEATURE_KAISER handling
Concentrate it in arch/x86/mm/kaiser.c and use the upstream string "nopti".

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:34 +01:00
Hugh Dickins
23e09439aa kaiser: add "nokaiser" boot option, using ALTERNATIVE
Added "nokaiser" boot option: an early param like "noinvpcid".
Most places now check int kaiser_enabled (#defined 0 when not
CONFIG_KAISER) instead of #ifdef CONFIG_KAISER; but entry_64.S
and entry_64_compat.S are using the ALTERNATIVE technique, which
patches in the preferred instructions at runtime.  That technique
is tied to x86 cpu features, so X86_FEATURE_KAISER is fabricated.

Prior to "nokaiser", Kaiser #defined _PAGE_GLOBAL 0: revert that,
but be careful with both _PAGE_GLOBAL and CR4.PGE: setting them when
nokaiser like when !CONFIG_KAISER, but not setting either when kaiser -
neither matters on its own, but it's hard to be sure that _PAGE_GLOBAL
won't get set in some obscure corner, or something add PGE into CR4.
By omitting _PAGE_GLOBAL from __supported_pte_mask when kaiser_enabled,
all page table setup which uses pte_pfn() masks it out of the ptes.

It's slightly shameful that the same declaration versus definition of
kaiser_enabled appears in not one, not two, but in three header files
(asm/kaiser.h, asm/pgtable.h, asm/tlbflush.h).  I felt safer that way,
than with #including any of those in any of the others; and did not
feel it worth an asm/kaiser_enabled.h - kernel/cpu/common.c includes
them all, so we shall hear about it if they get out of synch.

Cleanups while in the area: removed the silly #ifdef CONFIG_KAISER
from kaiser.c; removed the unused native_get_normal_pgd(); removed
the spurious reg clutter from SWITCH_*_CR3 macro stubs; corrected some
comments.  But more interestingly, set CR4.PSE in secondary_startup_64:
the manual is clear that it does not matter whether it's 0 or 1 when
4-level-pts are enabled, but I was distracted to find cr4 different on
BSP and auxiliaries - BSP alone was adding PSE, in probe_page_size_mask().

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:34 +01:00
Hugh Dickins
cb7d8d7e67 kaiser: fix unlikely error in alloc_ldt_struct()
An error from kaiser_add_mapping() here is not at all likely, but
Eric Biggers rightly points out that __free_ldt_struct() relies on
new_ldt->size being initialized: move that up.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:34 +01:00
Hugh Dickins
3df1461787 kaiser: kaiser_remove_mapping() move along the pgd
When removing the bogus comment from kaiser_remove_mapping(),
I really ought to have checked the extent of its bogosity: as
Neel points out, there is nothing to stop unmap_pud_range_nofree()
from continuing beyond the end of a pud (and starting in the wrong
position on the next).

Fix kaiser_remove_mapping() to constrain the extent and advance pgd
pointer correctly: use pgd_addr_end() macro as used throughout base
mm (but don't assume page-rounded start and size in this case).

But this bug was very unlikely to trigger in this backport: since
any buddy allocation is contained within a single pud extent, and
we are not using vmapped stacks (and are only mapping one page of
stack anyway): the only way to hit this bug here would be when
freeing a large modified ldt.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:34 +01:00
Hugh Dickins
05ddad146d kaiser: paranoid_entry pass cr3 need to paranoid_exit
Neel Natu points out that paranoid_entry() was wrong to assume that
an entry that did not need swapgs would not need SWITCH_KERNEL_CR3:
paranoid_entry (used for debug breakpoint, int3, double fault or MCE;
though I think it's only the MCE case that is cause for concern here)
can break in at an awkward time, between cr3 switch and swapgs, but
its handling always needs kernel gs and kernel cr3.

Easy to fix in itself, but paranoid_entry() also needs to convey to
paranoid_exit() (and my reading of macro idtentry says paranoid_entry
and paranoid_exit are always paired) how to restore the prior state.
The swapgs state is already conveyed by %ebx (0 or 1), so extend that
also to convey when SWITCH_USER_CR3 will be needed (2 or 3).

(Yes, I'd much prefer that 0 meant no swapgs, whereas it's the other
way round: and a convention shared with error_entry() and error_exit(),
which I don't want to touch.  Perhaps I should have inverted the bit
for switch cr3 too, but did not.)

paranoid_exit() would be straightforward, except for TRACE_IRQS: it
did TRACE_IRQS_IRETQ when doing swapgs, but TRACE_IRQS_IRETQ_DEBUG
when not: which is it supposed to use when SWITCH_USER_CR3 is split
apart from that?  As best as I can determine, commit 5963e317b1
("ftrace/x86: Do not change stacks in DEBUG when calling lockdep")
missed the swapgs case, and should have used TRACE_IRQS_IRETQ_DEBUG
there too (the discrepancy has nothing to do with the liberal use
of _NO_STACK and _UNSAFE_STACK hereabouts: TRACE_IRQS_OFF_DEBUG has
just been used in all cases); discrepancy lovingly preserved across
several paranoid_exit() cleanups, but I'm now removing it.

Neel further indicates that to use SWITCH_USER_CR3_NO_STACK there in
paranoid_exit() is now not only unnecessary but unsafe: might corrupt
syscall entry's unsafe_stack_register_backup of %rax.  Just use
SWITCH_USER_CR3: and delete SWITCH_USER_CR3_NO_STACK altogether,
before we make the mistake of using it again.

hughd adds: this commit fixes an issue in the Kaiser-without-PCIDs
part of the series, and ought to be moved earlier, if you decided
to make a release of Kaiser-without-PCIDs.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:34 +01:00
Hugh Dickins
d0142ceb79 kaiser: x86_cr3_pcid_noflush and x86_cr3_pcid_user
Mostly this commit is just unshouting X86_CR3_PCID_KERN_VAR and
X86_CR3_PCID_USER_VAR: we usually name variables in lower-case.

But why does x86_cr3_pcid_noflush need to be __aligned(PAGE_SIZE)?
Ah, it's a leftover from when kaiser_add_user_map() once complained
about mapping the same page twice.  Make it __read_mostly instead.
(I'm a little uneasy about all the unrelated data which shares its
page getting user-mapped too, but that was so before, and not a big
deal: though we call it user-mapped, it's not mapped with _PAGE_USER.)

And there is a little change around the two calls to do_nmi().
Previously they set the NOFLUSH bit (if PCID supported) when
forcing to kernel context before do_nmi(); now they also have the
NOFLUSH bit set (if PCID supported) when restoring context after:
nothing done in do_nmi() should require a TLB to be flushed here.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:34 +01:00
Hugh Dickins
6a2b411761 kaiser: PCID 0 for kernel and 128 for user
Why was 4 chosen for kernel PCID and 6 for user PCID?
No good reason in a backport where PCIDs are only used for Kaiser.

If we continue with those, then we shall need to add Andy Lutomirski's
4.13 commit 6c690ee103 ("x86/mm: Split read_cr3() into read_cr3_pa()
and __read_cr3()"), which deals with the problem of read_cr3() callers
finding stray bits in the cr3 that they expected to be page-aligned;
and for hibernation, his 4.14 commit f34902c5c6 ("x86/hibernate/64:
Mask off CR3's PCID bits in the saved CR3").

But if 0 is used for kernel PCID, then there's no need to add in those
commits - whenever the kernel looks, it sees 0 in the lower bits; and
0 for kernel seems an obvious choice.

And I naughtily propose 128 for user PCID.  Because there's a place
in _SWITCH_TO_USER_CR3 where it takes note of the need for TLB FLUSH,
but needs to reset that to NOFLUSH for the next occasion.  Currently
it does so with a "movb $(0x80)" into the high byte of the per-cpu
quadword, but that will cause a machine without PCID support to crash.
Now, if %al just happened to have 0x80 in it at that point, on a
machine with PCID support, but 0 on a machine without PCID support...

(That will go badly wrong once the pgd can be at a physical address
above 2^56, but even with 5-level paging, physical goes up to 2^52.)

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:34 +01:00
Hugh Dickins
0b5ca9d995 kaiser: load_new_mm_cr3() let SWITCH_USER_CR3 flush user
We have many machines (Westmere, Sandybridge, Ivybridge) supporting
PCID but not INVPCID: on these load_new_mm_cr3() simply crashed.

Flushing user context inside load_new_mm_cr3() without the use of
invpcid is difficult: momentarily switch from kernel to user context
and back to do so?  I'm not sure whether that can be safely done at
all, and would risk polluting user context with kernel internals,
and kernel context with stale user externals.

Instead, follow the hint in the comment that was there: change
X86_CR3_PCID_USER_VAR to be a per-cpu variable, then load_new_mm_cr3()
can leave a note in it, for SWITCH_USER_CR3 on return to userspace to
flush user context TLB, instead of default X86_CR3_PCID_USER_NOFLUSH.

Which works well enough that there's no need to do it this way only
when invpcid is unsupported: it's a good alternative to invpcid here.
But there's a couple of inlines in asm/tlbflush.h that need to do the
same trick, so it's best to localize all this per-cpu business in
mm/kaiser.c: moving that part of the initialization from setup_pcid()
to kaiser_setup_pcid(); with kaiser_flush_tlb_on_return_to_user() the
function for noting an X86_CR3_PCID_USER_FLUSH.  And let's keep a
KAISER_SHADOW_PGD_OFFSET in there, to avoid the extra OR on exit.

I did try to make the feature tests in asm/tlbflush.h more consistent
with each other: there seem to be far too many ways of performing such
tests, and I don't have a good grasp of their differences.  At first
I converted them all to be static_cpu_has(): but that proved to be a
mistake, as the comment in __native_flush_tlb_single() hints; so then
I reversed and made them all this_cpu_has().  Probably all gratuitous
change, but that's the way it's working at present.

I am slightly bothered by the way non-per-cpu X86_CR3_PCID_KERN_VAR
gets re-initialized by each cpu (before and after these changes):
no problem when (as usual) all cpus on a machine have the same
features, but in principle incorrect.  However, my experiment
to per-cpu-ify that one did not end well...

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:34 +01:00
Hugh Dickins
2684b12a16 kaiser: enhanced by kernel and user PCIDs
Merged performance improvements to Kaiser, using distinct kernel
and user Process Context Identifiers to minimize the TLB flushing.

[This work actually all from Dave Hansen 2017-08-30:
still omitting trackswitch mods, and KAISER_REAL_SWITCH deleted.]

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:34 +01:00
Hugh Dickins
1972bb9d92 kaiser: vmstat show NR_KAISERTABLE as nr_overhead
The kaiser update made an interesting choice, never to free any shadow
page tables.  Contention on global spinlock was worrying, particularly
with it held across page table scans when freeing.  Something had to be
done: I was going to add refcounting; but simply never to free them is
an appealing choice, minimizing contention without complicating the code
(the more a page table is found already, the less the spinlock is used).

But leaking pages in this way is also a worry: can we get away with it?
At the very least, we need a count to show how bad it actually gets:
in principle, one might end up wasting about 1/256 of memory that way
(1/512 for when direct-mapped pages have to be user-mapped, plus 1/512
for when they are user-mapped from the vmalloc area on another occasion
(but we don't have vmalloc'ed stacks, so only large ldts are vmalloc'ed).

Add per-cpu stat NR_KAISERTABLE: including 256 at startup for the
shared pgd entries, and 1 for each intermediate page table added
thereafter for user-mapping - but leave out the 1 per mm, for its
shadow pgd, because that distracts from the monotonic increase.
Shown in /proc/vmstat as nr_overhead (0 if kaiser not enabled).

In practice, it doesn't look so bad so far: more like 1/12000 after
nine hours of gtests below; and movable pageblock segregation should
tend to cluster the kaiser tables into a subset of the address space
(if not, they will be bad for compaction too).  But production may
tell a different story: keep an eye on this number, and bring back
lighter freeing if it gets out of control (maybe a shrinker).

["nr_overhead" should of course say "nr_kaisertable", if it needs
to stay; but for the moment we are being coy, preferring that when
Joe Blow notices a new line in his /proc/vmstat, he does not get
too curious about what this "kaiser" stuff might be.]

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:33 +01:00
Hugh Dickins
1ce27de401 kaiser: delete KAISER_REAL_SWITCH option
We fail to see what CONFIG_KAISER_REAL_SWITCH is for: it seems to be
left over from early development, and now just obscures tricky parts
of the code.  Delete it before adding PCIDs, or nokaiser boot option.

(Or if there is some good reason to keep the option, then it needs
a help text - and a "depends on KAISER", so that all those without
KAISER are not asked the question.  But we'd much rather delete it.)

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:33 +01:00
Hugh Dickins
c27cdea56c kaiser: name that 0x1000 KAISER_SHADOW_PGD_OFFSET
There's a 0x1000 in various places, which looks better with a name.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:33 +01:00
Hugh Dickins
61b7a404fa kaiser: cleanups while trying for gold link
While trying to get our gold link to work, four cleanups:
matched the gdt_page declaration to its definition;
in fiddling unsuccessfully with PERCPU_INPUT(), lined up backslashes;
lined up the backslashes according to convention in percpu-defs.h;
deleted the unused irq_stack_pointer addition to irq_stack_union.

Sad to report that aligning backslashes does not appear to help gold
align to 8192: but while these did not help, they are worth keeping.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:33 +01:00
Hugh Dickins
604db49610 kaiser: align addition to x86/mm/Makefile
Use tab not space so they line up properly, kaslr.o also.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:33 +01:00
Hugh Dickins
be6bf01f4c kaiser: tidied up kaiser_add/remove_mapping slightly
Yes, unmap_pud_range_nofree()'s declaration ought to be in a
header file really, but I'm not sure we want to use it anyway:
so for now just declare it inside kaiser_remove_mapping().
And there doesn't seem to be such a thing as unmap_p4d_range(),
even in a 5-level paging tree.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:33 +01:00
Hugh Dickins
67fab0d4ac kaiser: tidied up asm/kaiser.h somewhat
Mainly deleting a surfeit of blank lines, and reflowing header comment.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:33 +01:00
Hugh Dickins
f43f386f0b kaiser: ENOMEM if kaiser_pagetable_walk() NULL
kaiser_add_user_map() took no notice when kaiser_pagetable_walk() failed.
And avoid its might_sleep() when atomic (though atomic at present unused).

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:33 +01:00
Hugh Dickins
f881e62684 kaiser: fix perf crashes
Avoid perf crashes: place debug_store in the user-mapped per-cpu area
instead of allocating, and use page allocator plus kaiser_add_mapping()
to keep the BTS and PEBS buffers user-mapped (that is, present in the
user mapping, though visible only to kernel and hardware).  The PEBS
fixup buffer does not need this treatment.

The need for a user-mapped struct debug_store showed up before doing
any conscious perf testing: in a couple of kernel paging oopses on
Westmere, implicating the debug_store offset of the per-cpu area.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:33 +01:00
Hugh Dickins
1937794431 kaiser: fix regs to do_nmi() ifndef CONFIG_KAISER
pjt has observed that nmi's second (nmi_from_kernel) call to do_nmi()
adjusted the %rdi regs arg, rightly when CONFIG_KAISER, but wrongly
when not CONFIG_KAISER.

Although the minimal change is to add an #ifdef CONFIG_KAISER around
the addq line, that looks cluttered, and I prefer how the first call
to do_nmi() handled it: prepare args in %rdi and %rsi before getting
into the CONFIG_KAISER block, since it does not touch them at all.

And while we're here, place the "#ifdef CONFIG_KAISER" that follows
each, to enclose the "Unconditionally restore CR3" comment: matching
how the "Unconditionally use kernel CR3" comment above is enclosed.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:32 +01:00
Hugh Dickins
639c005dae kaiser: KAISER depends on SMP
It is absurd that KAISER should depend on SMP, but apparently nobody
has tried a UP build before: which breaks on implicit declaration of
function 'per_cpu_offset' in arch/x86/mm/kaiser.c.

Now, you would expect that to be trivially fixed up; but looking at
the System.map when that block is #ifdef'ed out of kaiser_init(),
I see that in a UP build __per_cpu_user_mapped_end is precisely at
__per_cpu_user_mapped_start, and the items carefully gathered into
that section for user-mapping on SMP, dispersed elsewhere on UP.

So, some other kind of section assignment will be needed on UP,
but implementing that is not a priority: just make KAISER depend
on SMP for now.

Also inserted a blank line before the option, tidied up the
brief Kconfig help message, and added an "If unsure, Y".

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:32 +01:00
Hugh Dickins
7a92e20d15 kaiser: fix build and FIXME in alloc_ldt_struct()
Include linux/kaiser.h instead of asm/kaiser.h to build ldt.c without
CONFIG_KAISER.  kaiser_add_mapping() does already return an error code,
so fix the FIXME.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:32 +01:00
Hugh Dickins
0994a2cf8f kaiser: stack map PAGE_SIZE at THREAD_SIZE-PAGE_SIZE
Kaiser only needs to map one page of the stack; and
kernel/fork.c did not build on powerpc (no __PAGE_KERNEL).
It's all cleaner if linux/kaiser.h provides kaiser_map_thread_stack()
and kaiser_unmap_thread_stack() wrappers around asm/kaiser.h's
kaiser_add_mapping() and kaiser_remove_mapping().  And use
linux/kaiser.h in init/main.c to avoid the #ifdefs there.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:32 +01:00
Hugh Dickins
ac2f1018ac kaiser: do not set _PAGE_NX on pgd_none
native_pgd_clear() uses native_set_pgd(), so native_set_pgd() must
avoid setting the _PAGE_NX bit on an otherwise pgd_none() entry:
usually that just generated a warning on exit, but sometimes
more mysterious and damaging failures (our production machines
could not complete booting).

The original fix to this just avoided adding _PAGE_NX to
an empty entry; but eventually more problems surfaced with kexec,
and EFI mapping expected to be a problem too.  So now instead
change native_set_pgd() to update shadow only if _PAGE_USER:

A few places (kernel/machine_kexec_64.c, platform/efi/efi_64.c for sure)
use set_pgd() to set up a temporary internal virtual address space, with
physical pages remapped at what Kaiser regards as userspace addresses:
Kaiser then assumes a shadow pgd follows, which it will try to corrupt.

This appears to be responsible for the recent kexec and kdump failures;
though it's unclear how those did not manifest as a problem before.
Ah, the shadow pgd will only be assumed to "follow" if the requested
pgd is on an even-numbered page: so I suppose it was going wrong 50%
of the time all along.

What we need is a flag to set_pgd(), to tell it we're dealing with
userspace.  Er, isn't that what the pgd's _PAGE_USER bit is saying?
Add a test for that.  But we cannot do the same for pgd_clear()
(which may be called to clear corrupted entries - set aside the
question of "corrupt in which pgd?" until later), so there just
rely on pgd_clear() not being called in the problematic cases -
with a WARN_ON_ONCE() which should fire half the time if it is.

But this is getting too big for an inline function: move it into
arch/x86/mm/kaiser.c (which then demands a boot/compressed mod);
and de-void and de-space native_get_shadow/normal_pgd() while here.

Also make an unnecessary change to KASLR's init_trampoline(): it was
using set_pgd() to assign a pgd-value to a global variable (not in a
pg directory page), which was rather scary given Kaiser's previous
set_pgd() implementation: not a problem now, but too scary to leave
as was, it could easily blow up if we have to change set_pgd() again.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:32 +01:00
Dave Hansen
8f0baadf2b kaiser: merged update
Merged fixes and cleanups, rebased to 4.9.51 tree (no 5-level paging).

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:32 +01:00
Richard Fellner
13be4483bb KAISER: Kernel Address Isolation
This patch introduces our implementation of KAISER (Kernel Address Isolation to
have Side-channels Efficiently Removed), a kernel isolation technique to close
hardware side channels on kernel address information.

More information about the patch can be found on:

        https://github.com/IAIK/KAISER

From: Richard Fellner <richard.fellner@student.tugraz.at>
From: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Subject: [RFC, PATCH] x86_64: KAISER - do not map kernel in user mode
Date: Thu, 4 May 2017 14:26:50 +0200
Link: http://marc.info/?l=linux-kernel&m=149390087310405&w=2
Kaiser-4.10-SHA1: c4b1831d44c6144d3762ccc72f0c4e71a0c713e5

To: <linux-kernel@vger.kernel.org>
To: <kernel-hardening@lists.openwall.com>
Cc: <clementine.maurice@iaik.tugraz.at>
Cc: <moritz.lipp@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: <kirill.shutemov@linux.intel.com>
Cc: <anders.fogh@gdata-adan.de>

After several recent works [1,2,3] KASLR on x86_64 was basically
considered dead by many researchers. We have been working on an
efficient but effective fix for this problem and found that not mapping
the kernel space when running in user mode is the solution to this
problem [4] (the corresponding paper [5] will be presented at ESSoS17).

With this RFC patch we allow anybody to configure their kernel with the
flag CONFIG_KAISER to add our defense mechanism.

If there are any questions we would love to answer them.
We also appreciate any comments!

Cheers,
Daniel (+ the KAISER team from Graz University of Technology)

[1] http://www.ieee-security.org/TC/SP2013/papers/4977a191.pdf
[2] https://www.blackhat.com/docs/us-16/materials/us-16-Fogh-Using-Undocumented-CPU-Behaviour-To-See-Into-Kernel-Mode-And-Break-KASLR-In-The-Process.pdf
[3] https://www.blackhat.com/docs/us-16/materials/us-16-Jang-Breaking-Kernel-Address-Space-Layout-Randomization-KASLR-With-Intel-TSX.pdf
[4] https://github.com/IAIK/KAISER
[5] https://gruss.cc/files/kaiser.pdf

[patch based also on
https://raw.githubusercontent.com/IAIK/KAISER/master/KAISER/0001-KAISER-Kernel-Address-Isolation.patch]

Signed-off-by: Richard Fellner <richard.fellner@student.tugraz.at>
Signed-off-by: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Signed-off-by: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Signed-off-by: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:32 +01:00
Tom Lendacky
b5fd58e997 x86/boot: Add early cmdline parsing for options with arguments
commit e505371dd8 upstream.

Add a cmdline_find_option() function to look for cmdline options that
take arguments. The argument is returned in a supplied buffer and the
argument length (regardless of whether it fits in the supplied buffer)
is returned, with -1 indicating not found.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Toshimitsu Kani <toshi.kani@hpe.com>
Cc: kasan-dev@googlegroups.com
Cc: kvm@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-efi@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/36b5f97492a9745dce27682305f990fc20e5cf8a.1500319216.git.thomas.lendacky@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:32 +01:00
Neal Cardwell
8824b2d7ab tcp_bbr: reset long-term bandwidth sampling on loss recovery undo
commit 600647d467 upstream.

Fix BBR so that upon notification of a loss recovery undo BBR resets
long-term bandwidth sampling.

Under high reordering, reordering events can be interpreted as loss.
If the reordering and spurious loss estimates are high enough, this
can cause BBR to spuriously estimate that we are seeing loss rates
high enough to trigger long-term bandwidth estimation. To avoid that
problem, this commit resets long-term bandwidth sampling on loss
recovery undo events.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:31 +01:00
Neal Cardwell
61c51da2b4 tcp_bbr: reset full pipe detection on loss recovery undo
commit 2f6c498e4f upstream.

Fix BBR so that upon notification of a loss recovery undo BBR resets
the full pipe detection (STARTUP exit) state machine.

Under high reordering, reordering events can be interpreted as loss.
If the reordering and spurious loss estimates are high enough, this
could previously cause BBR to spuriously estimate that the pipe is
full.

Since spurious loss recovery means that our overall sending will have
slowed down spuriously, this commit gives a flow more time to probe
robustly for bandwidth and decide the pipe is really full.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:31 +01:00
Daniel Rosenberg
173c52eae9 ANDROID: sdcardfs: Add default_normal option
The default_normal option causes mounts with the gid set to
AID_SDCARD_RW to have user specific gids, as in the normal case.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Change-Id: I9619b8ac55f41415df943484dc8db1ea986cef6f
Bug: 64672411
2018-01-04 00:32:00 +00:00
Daniel Rosenberg
f18c44dc55 ANDROID: sdcardfs: notify lower file of opens
fsnotify_open is not called within dentry_open,
so we need to call it ourselves.

Change-Id: Ia7f323b3d615e6ca5574e114e8a5d7973fb4c119
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 70706497
2018-01-04 00:17:37 +00:00
Hyojun Kim
11537d5ac9 blkdev: Refactoring block io latency histogram codes
The current io_latency_state structure includes entries for read
and write requests. There are special types of write commands
such as sync and discard (trim) commands, and the current
implementation is not general enough if we want to separate
latency histogram for such special commands.

This change makes io_latency_state structure request-type neutral.
It also changes to print the latency average.

Signed-off-by: Hyojun Kim <hyojun@google.com>
2018-01-03 21:16:40 +00:00
Vincent Guittot
4406671f9e UPSTREAM: netfilter: conntrack: use power efficient workqueue
conntrack uses the bounded system_long_wq workqueue for its works that
don't have to run on the cpu they have been queued.
Using bounded workqueue prevents the scheduler to make smart decision about
the best place to schedule the work.

This patch replaces system_long_wq with system_power_efficient_wq. the work
stays bounded to a cpu by default unless the CONFIG_WQ_POWER_EFFICIENT is
enable. In the latter case, the work can be scheduled on the best cpu from
a power or a performance point of view.

Change-Id: I12f862e7476b8e77c50e989ce16bd49b52063c1a
(cherry picked from commit 0984d427c1)
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2018-01-03 13:40:30 +05:30
Greg Kroah-Hartman
9e5dd8ed9b Merge 4.9.74 into android-4.9
Changes in 4.9.74
	sync objtool's copy of x86-opcode-map.txt
	tracing: Remove extra zeroing out of the ring buffer page
	tracing: Fix possible double free on failure of allocating trace buffer
	tracing: Fix crash when it fails to alloc ring buffer
	ring-buffer: Mask out the info bits when returning buffer page length
	iw_cxgb4: Only validate the MSN for successful completions
	ASoC: wm_adsp: Fix validation of firmware and coeff lengths
	ASoC: da7218: fix fix child-node lookup
	ASoC: fsl_ssi: AC'97 ops need regmap, clock and cleaning up on failure
	ASoC: twl4030: fix child-node lookup
	ASoC: tlv320aic31xx: Fix GPIO1 register definition
	ALSA: hda: Drop useless WARN_ON()
	ALSA: hda - fix headset mic detection issue on a Dell machine
	x86/vm86/32: Switch to flush_tlb_mm_range() in mark_screen_rdonly()
	x86/mm: Remove flush_tlb() and flush_tlb_current_task()
	x86/mm: Make flush_tlb_mm_range() more predictable
	x86/mm: Reimplement flush_tlb_page() using flush_tlb_mm_range()
	x86/mm: Remove the UP asm/tlbflush.h code, always use the (formerly) SMP code
	x86/mm: Disable PCID on 32-bit kernels
	x86/mm: Add the 'nopcid' boot option to turn off PCID
	x86/mm: Enable CR4.PCIDE on supported systems
	x86/mm/64: Fix reboot interaction with CR4.PCIDE
	kbuild: add '-fno-stack-check' to kernel build options
	ipv4: igmp: guard against silly MTU values
	ipv6: mcast: better catch silly mtu values
	net: fec: unmap the xmit buffer that are not transferred by DMA
	net: igmp: Use correct source address on IGMPv3 reports
	netlink: Add netns check on taps
	net: qmi_wwan: add Sierra EM7565 1199:9091
	net: reevalulate autoflowlabel setting after sysctl setting
	ptr_ring: add barriers
	RDS: Check cmsg_len before dereferencing CMSG_DATA
	tcp_bbr: record "full bw reached" decision in new full_bw_reached bit
	tcp md5sig: Use skb's saddr when replying to an incoming segment
	tg3: Fix rx hang on MTU change with 5717/5719
	net: ipv4: fix for a race condition in raw_sendmsg
	net: mvmdio: disable/unprepare clocks in EPROBE_DEFER case
	sctp: Replace use of sockets_allocated with specified macro.
	adding missing rcu_read_unlock in ipxip6_rcv
	ipv4: Fix use-after-free when flushing FIB tables
	net: bridge: fix early call to br_stp_change_bridge_id and plug newlink leaks
	net: fec: Allow reception of frames bigger than 1522 bytes
	net: Fix double free and memory corruption in get_net_ns_by_id()
	net: phy: micrel: ksz9031: reconfigure autoneg after phy autoneg workaround
	sock: free skb in skb_complete_tx_timestamp on error
	tcp: invalidate rate samples during SACK reneging
	net/mlx5: Fix rate limit packet pacing naming and struct
	net/mlx5e: Fix features check of IPv6 traffic
	net/mlx5e: Fix possible deadlock of VXLAN lock
	net/mlx5e: Add refcount to VXLAN structure
	net/mlx5e: Prevent possible races in VXLAN control flow
	net/mlx5: Fix error flow in CREATE_QP command
	s390/qeth: apply takeover changes when mode is toggled
	s390/qeth: don't apply takeover changes to RXIP
	s390/qeth: lock IP table while applying takeover changes
	s390/qeth: update takeover IPs after configuration change
	usbip: fix usbip bind writing random string after command in match_busid
	usbip: prevent leaking socket pointer address in messages
	usbip: stub: stop printing kernel pointer addresses in messages
	usbip: vhci: stop printing kernel pointer addresses in messages
	USB: serial: ftdi_sio: add id for Airbus DS P8GR
	USB: serial: qcserial: add Sierra Wireless EM7565
	USB: serial: option: add support for Telit ME910 PID 0x1101
	USB: serial: option: adding support for YUGA CLM920-NC5
	usb: Add device quirk for Logitech HD Pro Webcam C925e
	usb: add RESET_RESUME for ELSA MicroLink 56K
	USB: Fix off by one in type-specific length check of BOS SSP capability
	usb: xhci: Add XHCI_TRUST_TX_LENGTH for Renesas uPD720201
	timers: Use deferrable base independent of base::nohz_active
	timers: Invoke timer_start_debug() where it makes sense
	timers: Reinitialize per cpu bases on hotplug
	nohz: Prevent a timer interrupt storm in tick_nohz_stop_sched_tick()
	x86/smpboot: Remove stale TLB flush invocations
	n_tty: fix EXTPROC vs ICANON interaction with TIOCINQ (aka FIONREAD)
	tty: fix tty_ldisc_receive_buf() documentation
	mm/vmstat: Make NR_TLB_REMOTE_FLUSH_RECEIVED available even on UP
	Linux 4.9.74

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-01-02 20:45:15 +01:00
Greg Kroah-Hartman
07bcb2489b Linux 4.9.74 2018-01-02 20:35:18 +01:00
Andy Lutomirski
181a832c2e mm/vmstat: Make NR_TLB_REMOTE_FLUSH_RECEIVED available even on UP
commit 5dd0b16cda upstream.

This fixes CONFIG_SMP=n, CONFIG_DEBUG_TLBFLUSH=y without introducing
further #ifdef soup.  Caught by a Kbuild bot randconfig build.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: ce4a4e565f ("x86/mm: Remove the UP asm/tlbflush.h code, always use the (formerly) SMP code")
Link: http://lkml.kernel.org/r/76da9a3cc4415996f2ad2c905b93414add322021.1496673616.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-02 20:35:18 +01:00
Johan Hovold
d76dabb5af tty: fix tty_ldisc_receive_buf() documentation
commit e7e51dcf3b upstream.

The tty_ldisc_receive_buf() helper returns the number of bytes
processed so drop the bogus "not" from the kernel doc comment.

Fixes: 8d082cd300 ("tty: Unify receive_buf() code paths")
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-02 20:35:17 +01:00