Commit Graph

796629 Commits

Author SHA1 Message Date
Dennis Wassenberg
22454b79e6 usb: core: Fix hub port connection events lost
This will clear the USB_PORT_FEAT_C_CONNECTION bit in case of a hub port reset
only if a device is was attached to the hub port before resetting the hub port.

Using a Lenovo T480s attached to the ultra dock it was not possible to detect
some usb-c devices at the dock usb-c ports because the hub_port_reset code
will clear the USB_PORT_FEAT_C_CONNECTION bit after the actual hub port reset.
Using this device combo the USB_PORT_FEAT_C_CONNECTION bit was set between the
actual hub port reset and the clear of the USB_PORT_FEAT_C_CONNECTION bit.
This ends up with clearing the USB_PORT_FEAT_C_CONNECTION bit after the
new device was attached such that it was not detected.

This patch will not clear the USB_PORT_FEAT_C_CONNECTION bit if there is
currently no device attached to the port before the hub port reset.
This will avoid clearing the connection bit for new attached devices.

Signed-off-by: Dennis Wassenberg <dennis.wassenberg@secunet.com>
Acked-by: Mathias Nyman <mathias.nyman@linux.intel.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-14 14:28:26 -08:00
Greg Kroah-Hartman
4b8440abc9 Merge tag 'fixes-for-v4.20-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/balbi/usb into usb-linus
Felipe writes:

For now only 5 small fixes. Most importantly, we have a fix for the TRB
type used on unaligned transfers on dwc3. Also a fix for a NULL pointer
dereference in dwc3_pci_remove(). Note that a recent commit on ffs was
reverted because it causes a regression elsewere.

* tag 'fixes-for-v4.20-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/balbi/usb:
  usb: dwc3: gadget: fix ISOC TRB type on unaligned transfers
  Revert "usb: gadget: ffs: Fix BUG when userland exits with submitted AIO transfers"
  usb: dwc2: pci: Fix an error code in probe
  usb: dwc3: Fix NULL pointer exception in dwc3_pci_remove()
  usb: dwc3: gadget: Properly check last unaligned/zero chain TRB
  usb: dwc3: core: Clean up ULPI device
2018-11-14 14:24:07 -08:00
Felipe Balbi
2fc6d4be35 usb: dwc3: gadget: fix ISOC TRB type on unaligned transfers
When chaining ISOC TRBs together, only the first ISOC TRB should be of
type ISOC_FIRST, all others should be of type ISOC. This patch fixes
that.

Fixes: c6267a5163 ("usb: dwc3: gadget: align transfers to wMaxPacketSize")
Cc: <stable@vger.kernel.org> # v4.11+
Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
2018-11-14 12:43:52 +02:00
Shen Jing
a9c859033f Revert "usb: gadget: ffs: Fix BUG when userland exits with submitted AIO transfers"
This reverts commit b4194da3f9087dd38d91b40f9bec42d59ce589a8
since it causes list corruption followed by kernel panic:

Workqueue: adb ffs_aio_cancel_worker
RIP: 0010:__list_add_valid+0x4d/0x70
Call Trace:
insert_work+0x47/0xb0
__queue_work+0xf6/0x400
queue_work_on+0x65/0x70
dwc3_gadget_giveback+0x44/0x50 [dwc3]
dwc3_gadget_ep_dequeue+0x83/0x2d0 [dwc3]
? finish_wait+0x80/0x80
usb_ep_dequeue+0x1e/0x90
process_one_work+0x18c/0x3b0
worker_thread+0x3c/0x390
? process_one_work+0x3b0/0x3b0
kthread+0x11e/0x140
? kthread_create_worker_on_cpu+0x70/0x70
ret_from_fork+0x3a/0x50

This issue is seen with warm reboot stability testing.

Signed-off-by: Shen Jing <jingx.shen@intel.com>
Signed-off-by: Saranya Gopal <saranya.gopal@intel.com>
Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
2018-11-14 11:15:13 +02:00
Dan Carpenter
3c135e8900 usb: dwc2: pci: Fix an error code in probe
We added some error handling to this function but forgot to set the
error code on this path.

Fixes: ecd29dabb2 ("usb: dwc2: pci: Handle error cleanup in probe")
Acked-by: Minas Harutyunyan <hminas@synopsys.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
2018-11-14 11:07:12 +02:00
Kuppuswamy Sathyanarayanan
7b412b04a0 usb: dwc3: Fix NULL pointer exception in dwc3_pci_remove()
In dwc3_pci_quirks() function, gpiod lookup table is only registered for
baytrail SOC. But in dwc3_pci_remove(), we try to unregistered it
without any checks. This leads to NULL pointer de-reference exception in
gpiod_remove_lookup_table() when unloading the module for non baytrail
SOCs. This patch fixes this issue.

Fixes: 5741022cbd ("usb: dwc3: pci: Add GPIO lookup table on platforms
without ACPI GPIO resources")
Cc: <stable@vger.kernel.org>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
2018-11-14 10:37:19 +02:00
Cherian, George
11644a7659 xhci: Add quirk to workaround the errata seen on Cavium Thunder-X2 Soc
Implement workaround for ThunderX2 Errata-129 (documented in
CN99XX Known Issues" available at Cavium support site).
As per ThunderX2errata-129, USB 2 device may come up as USB 1
if a connection to a USB 1 device is followed by another connection to
a USB 2 device, the link will come up as USB 1 for the USB 2 device.

Resolution: Reset the PHY after the USB 1 device is disconnected.
The PHY reset sequence is done using private registers in XHCI register
space. After the PHY is reset we check for the PLL lock status and retry
the operation if it fails. From our tests, retrying 4 times is sufficient.

Add a new quirk flag XHCI_RESET_PLL_ON_DISCONNECT to invoke the workaround
in handle_xhci_port_status().

Cc: stable@vger.kernel.org
Signed-off-by: George Cherian <george.cherian@cavium.com>
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-09 08:31:09 -08:00
Aaron Ma
a5baeaeabc usb: xhci: fix timeout for transition from RExit to U0
This definition is used by msecs_to_jiffies in milliseconds.
According to the comments, max rexit timeout should be 20ms.
Align with the comments to properly calculate the delay.

Verified on Sunrise Point-LP and Cannon Lake.

Cc: stable@vger.kernel.org
Signed-off-by: Aaron Ma <aaron.ma@canonical.com>
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-09 08:31:09 -08:00
Aaron Ma
958c0bd860 usb: xhci: fix uninitialized completion when USB3 port got wrong status
Realtek USB3.0 Card Reader [0bda:0328] reports wrong port status on
Cannon lake PCH USB3.1 xHCI [8086:a36d] after resume from S3,
after clear port reset it works fine.

Since this device is registered on USB3 roothub at boot,
when port status reports not superspeed, xhci_get_port_status will call
an uninitialized completion in bus_state[0].
Kernel will hang because of NULL pointer.

Restrict the USB2 resume status check in USB2 roothub to fix hang issue.

Cc: stable@vger.kernel.org
Signed-off-by: Aaron Ma <aaron.ma@canonical.com>
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-09 08:31:09 -08:00
Sandeep Singh
d9193efba8 xhci: Add check for invalid byte size error when UAS devices are connected.
Observed "TRB completion code (27)" error which corresponds to Stopped -
Length Invalid error(xhci spec section 4.17.4) while connecting USB to
SATA bridge.

Looks like this case was not considered when the following patch[1] was
committed. Hence adding this new check which can prevent
the invalid byte size error.

[1] ade2e3a xhci: handle transfer events without TRB pointer

Cc: <stable@vger.kernel.org>
Signed-off-by: Sandeep Singh <sandeep.singh@amd.com>
cc: Nehal Shah <Nehal-bakulchandra.Shah@amd.com>
cc: Shyam Sundar S K <Shyam-sundar.S-k@amd.com>
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-09 08:31:08 -08:00
Mathias Nyman
1245374e9b xhci: handle port status events for removed USB3 hcd
At xhci removal the USB3 hcd (shared_hcd) is removed before the primary
USB2 hcd. Interrupts for port status changes may still occur for USB3
ports after the shared_hcd is freed, causing  NULL pointer dereference.

Check if xhci->shared_hcd is still valid before handing USB3 port events

Cc: <stable@vger.kernel.org>
Reported-by: Peter Chen <peter.chen@nxp.com>
Tested-by: Jack Pham <jackp@codeaurora.org>
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-09 08:31:08 -08:00
Mathias Nyman
f068090426 xhci: Fix leaking USB3 shared_hcd at xhci removal
Ensure that the shared_hcd pointer is valid when calling usb_put_hcd()

The shared_hcd is removed and freed in xhci by first calling
usb_remove_hcd(xhci->shared_hcd), and later
usb_put_hcd(xhci->shared_hcd)

Afer commit fe190ed0d6 ("xhci: Do not halt the host until both HCD have
disconnected their devices.") the shared_hcd was never properly put as
xhci->shared_hcd was set to NULL before usb_put_hcd(xhci->shared_hcd) was
called.

shared_hcd (USB3) is removed before primary hcd (USB2).
While removing the primary hcd we might need to handle xhci interrupts
to cleanly remove last USB2 devices, therefore we need to set
xhci->shared_hcd to NULL before removing the primary hcd to let xhci
interrupt handler know shared_hcd is no longer available.

xhci-plat.c, xhci-histb.c and xhci-mtk first create both their hcd's before
adding them. so to keep the correct reverse removal order use a temporary
shared_hcd variable for them.
For more details see commit 4ac53087d6 ("usb: xhci: plat: Create both
HCDs before adding them")

Fixes: fe190ed0d6 ("xhci: Do not halt the host until both HCD have disconnected their devices.")
Cc: Joel Stanley <joel@jms.id.au>
Cc: Chunfeng Yun <chunfeng.yun@mediatek.com>
Cc: Thierry Reding <treding@nvidia.com>
Cc: Jianguo Sun <sunjianguo1@huawei.com>
Cc: <stable@vger.kernel.org>
Reported-by: Jack Pham <jackp@codeaurora.org>
Tested-by: Jack Pham <jackp@codeaurora.org>
Tested-by: Peter Chen <peter.chen@nxp.com>
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-09 08:31:08 -08:00
Mattias Jacobsson
f6501f4919 USB: misc: appledisplay: add 20" Apple Cinema Display
Add another Apple Cinema Display to the list of supported displays

Signed-off-by: Mattias Jacobsson <2pi@mok.nu>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-07 13:23:18 +01:00
Kai-Heng Feng
deefd24228 USB: quirks: Add no-lpm quirk for Raydium touchscreens
Raydium USB touchscreen fails to set config if LPM is enabled:
[    2.030658] usb 1-8: New USB device found, idVendor=2386, idProduct=3119
[    2.030659] usb 1-8: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[    2.030660] usb 1-8: Product: Raydium Touch System
[    2.030661] usb 1-8: Manufacturer: Raydium Corporation
[    7.132209] usb 1-8: can't set config #1, error -110

Same behavior can be observed on 2386:3114.

Raydium claims the touchscreen supports LPM under Windows, so I used
Microsoft USB Test Tools (MUTT) [1] to check its LPM status. MUTT shows
that the LPM doesn't work under Windows, either. So let's just disable LPM
for Raydium touchscreens.

[1] https://docs.microsoft.com/en-us/windows-hardware/drivers/usbcon/usb-test-tools

Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-07 13:23:18 +01:00
Emmanuel Pescosta
a771125776 usb: quirks: Add delay-init quirk for Corsair K70 LUX RGB
Following on from this patch: https://lkml.org/lkml/2017/11/3/516,
Corsair K70 LUX RGB keyboards also require the DELAY_INIT quirk to
start correctly at boot.

Dmesg output:
usb 1-6: string descriptor 0 read error: -110
usb 1-6: New USB device found, idVendor=1b1c, idProduct=1b33
usb 1-6: New USB device strings: Mfr=1, Product=2, SerialNumber=3
usb 1-6: can't set config #1, error -110

Signed-off-by: Emmanuel Pescosta <emmanuelpescosta099@gmail.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-07 13:23:18 +01:00
Kai-Heng Feng
781f0766cc USB: Wait for extra delay time after USB_PORT_FEAT_RESET for quirky hub
Devices connected under Terminus Technology Inc. Hub (1a40:0101) may
fail to work after the system resumes from suspend:
[  206.063325] usb 3-2.4: reset full-speed USB device number 4 using xhci_hcd
[  206.143691] usb 3-2.4: device descriptor read/64, error -32
[  206.351671] usb 3-2.4: device descriptor read/64, error -32

Info for this hub:
T:  Bus=03 Lev=01 Prnt=01 Port=01 Cnt=01 Dev#=  2 Spd=480 MxCh= 4
D:  Ver= 2.00 Cls=09(hub  ) Sub=00 Prot=01 MxPS=64 #Cfgs=  1
P:  Vendor=1a40 ProdID=0101 Rev=01.11
S:  Product=USB 2.0 Hub
C:  #Ifs= 1 Cfg#= 1 Atr=e0 MxPwr=100mA
I:  If#= 0 Alt= 0 #EPs= 1 Cls=09(hub  ) Sub=00 Prot=00 Driver=hub

Some expirements indicate that the USB devices connected to the hub are
innocent, it's the hub itself is to blame. The hub needs extra delay
time after it resets its port.

Hence wait for extra delay, if the device is connected to this quirky
hub.

Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Cc: stable <stable@vger.kernel.org>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-07 13:23:18 +01:00
Thinh Nguyen
ba3a51ac32 usb: dwc3: gadget: Properly check last unaligned/zero chain TRB
Current check for the last extra TRB for zero and unaligned transfers
does not account for isoc OUT. The last TRB of the Buffer Descriptor for
isoc OUT transfers will be retired with HWO=0. As a result, we won't
return early. The req->remaining will be updated to include the BUFSIZ
count of the extra TRB, and the actual number of transferred bytes
calculation will be wrong.

To fix this, check whether it's a short or zero packet and the last TRB
chain bit to return early.

Fixes: c6267a5163 ("usb: dwc3: gadget: align transfers to wMaxPacketSize")
Cc: <stable@vger.kernel.org>
Signed-off-by: Thinh Nguyen <thinhn@synopsys.com>
Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
2018-11-06 08:40:47 +02:00
Andy Shevchenko
08fd9a82fd usb: dwc3: core: Clean up ULPI device
If dwc3_core_init_mode() fails with deferred probe,
next probe fails on sysfs with

sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:11.0/dwc3.0.auto/dwc3.0.auto.ulpi'

To avoid this failure, clean up ULPI device.

Cc: <stable@vger.kernel.org>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
2018-11-06 08:40:47 +02:00
Linus Torvalds
651022382c Linux 4.20-rc1 2018-11-04 15:37:52 -08:00
Linus Torvalds
42bd06e93d Merge tag 'tags/upstream-4.20-rc1' of git://git.infradead.org/linux-ubifs
Pull UBIFS updates from Richard Weinberger:

 - Full filesystem authentication feature, UBIFS is now able to have the
   whole filesystem structure authenticated plus user data encrypted and
   authenticated.

 - Minor cleanups

* tag 'tags/upstream-4.20-rc1' of git://git.infradead.org/linux-ubifs: (26 commits)
  ubifs: Remove unneeded semicolon
  Documentation: ubifs: Add authentication whitepaper
  ubifs: Enable authentication support
  ubifs: Do not update inode size in-place in authenticated mode
  ubifs: Add hashes and HMACs to default filesystem
  ubifs: authentication: Authenticate super block node
  ubifs: Create hash for default LPT
  ubfis: authentication: Authenticate master node
  ubifs: authentication: Authenticate LPT
  ubifs: Authenticate replayed journal
  ubifs: Add auth nodes to garbage collector journal head
  ubifs: Add authentication nodes to journal
  ubifs: authentication: Add hashes to index nodes
  ubifs: Add hashes to the tree node cache
  ubifs: Create functions to embed a HMAC in a node
  ubifs: Add helper functions for authentication support
  ubifs: Add separate functions to init/crc a node
  ubifs: Format changes for authentication support
  ubifs: Store read superblock node
  ubifs: Drop write_node
  ...
2018-11-04 14:46:04 -08:00
Linus Torvalds
4710e78940 Merge tag 'nfs-for-4.20-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
 "Highlights include:

  Bugfix:
   - Fix build issues on architectures that don't provide 64-bit cmpxchg

  Cleanups:
   - Fix a spelling mistake"

* tag 'nfs-for-4.20-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFS: fix spelling mistake, EACCESS -> EACCES
  SUNRPC: Use atomic(64)_t for seq_send(64)
2018-11-04 08:20:09 -08:00
Linus Torvalds
35e7452442 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull more timer updates from Thomas Gleixner:
 "A set of commits for the new C-SKY architecture timers"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  dt-bindings: timer: gx6605s SOC timer
  clocksource/drivers/c-sky: Add gx6605s SOC system timer
  dt-bindings: timer: C-SKY Multi-processor timer
  clocksource/drivers/c-sky: Add C-SKY SMP timer
2018-11-04 08:15:15 -08:00
Linus Torvalds
04578e8441 Merge tag 'ntb-4.20' of git://github.com/jonmason/ntb
Pull NTB updates from Jon Mason:
 "Fairly minor changes and bug fixes:

  NTB IDT thermal changes and hook into hwmon, ntb_netdev clean-up of
  private struct, and a few bug fixes"

* tag 'ntb-4.20' of git://github.com/jonmason/ntb:
  ntb: idt: Alter the driver info comments
  ntb: idt: Discard temperature sensor IRQ handler
  ntb: idt: Add basic hwmon sysfs interface
  ntb: idt: Alter temperature read method
  ntb_netdev: Simplify remove with client device drvdata
  NTB: transport: Try harder to alloc an aligned MW buffer
  ntb: ntb_transport: Mark expected switch fall-throughs
  ntb: idt: Set PCIe bus address to BARLIMITx
  NTB: ntb_hw_idt: replace IS_ERR_OR_NULL with regular NULL checks
  ntb: intel: fix return value for ndev_vec_mask()
  ntb_netdev: fix sleep time mismatch
2018-11-04 08:12:44 -08:00
Linus Torvalds
71e5602817 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "A memory (under-)allocation fix and a comment fix"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/topology: Fix off by one bug
  sched/rt: Update comment in pick_next_task_rt()
2018-11-03 18:37:09 -07:00
Linus Torvalds
601a88077c Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
 "A number of fixes and some late updates:

   - make in_compat_syscall() behavior on x86-32 similar to other
     platforms, this touches a number of generic files but is not
     intended to impact non-x86 platforms.

   - objtool fixes

   - PAT preemption fix

   - paravirt fixes/cleanups

   - cpufeatures updates for new instructions

   - earlyprintk quirk

   - make microcode version in sysfs world-readable (it is already
     world-readable in procfs)

   - minor cleanups and fixes"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  compat: Cleanup in_compat_syscall() callers
  x86/compat: Adjust in_compat_syscall() to generic code under !COMPAT
  objtool: Support GCC 9 cold subfunction naming scheme
  x86/numa_emulation: Fix uniform-split numa emulation
  x86/paravirt: Remove unused _paravirt_ident_32
  x86/mm/pat: Disable preemption around __flush_tlb_all()
  x86/paravirt: Remove GPL from pv_ops export
  x86/traps: Use format string with panic() call
  x86: Clean up 'sizeof x' => 'sizeof(x)'
  x86/cpufeatures: Enumerate MOVDIR64B instruction
  x86/cpufeatures: Enumerate MOVDIRI instruction
  x86/earlyprintk: Add a force option for pciserial device
  objtool: Support per-function rodata sections
  x86/microcode: Make revision and processor flags world-readable
2018-11-03 18:25:17 -07:00
Linus Torvalds
01897f3e05 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates and fixes from Ingo Molnar:
 "These are almost all tooling updates: 'perf top', 'perf trace' and
  'perf script' fixes and updates, an UAPI header sync with the merge
  window versions, license marker updates, much improved Sparc support
  from David Miller, and a number of fixes"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (66 commits)
  perf intel-pt/bts: Calculate cpumode for synthesized samples
  perf intel-pt: Insert callchain context into synthesized callchains
  perf tools: Don't clone maps from parent when synthesizing forks
  perf top: Start display thread earlier
  tools headers uapi: Update linux/if_link.h header copy
  tools headers uapi: Update linux/netlink.h header copy
  tools headers: Sync the various kvm.h header copies
  tools include uapi: Update linux/mmap.h copy
  perf trace beauty: Use the mmap flags table generated from headers
  perf beauty: Wire up the mmap flags table generator to the Makefile
  perf beauty: Add a generator for MAP_ mmap's flag constants
  tools include uapi: Update asound.h copy
  tools arch uapi: Update asm-generic/unistd.h and arm64 unistd.h copies
  tools include uapi: Update linux/fs.h copy
  perf callchain: Honour the ordering of PERF_CONTEXT_{USER,KERNEL,etc}
  perf cs-etm: Correct CPU mode for samples
  perf unwind: Take pgoff into account when reporting elf to libdwfl
  perf top: Do not use overwrite mode by default
  perf top: Allow disabling the overwrite mode
  perf trace: Beautify mount's first pathname arg
  ...
2018-11-03 18:13:43 -07:00
Linus Torvalds
e9ebc2151f Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Ingo Molnar:
 "An irqchip driver fix and a memory (over-)allocation fix"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/irq-mvebu-sei: Fix a NULL vs IS_ERR() bug in probe function
  irq/matrix: Fix memory overallocation
2018-11-03 18:12:09 -07:00
Peter Zijlstra
993f0b0510 sched/topology: Fix off by one bug
With the addition of the NUMA identity level, we increased @level by
one and will run off the end of the array in the distance sort loop.

Fixed: 051f3ca02e ("sched/topology: Introduce NUMA identity node sched domain")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-11-04 00:40:03 +01:00
Ingo Molnar
23a12ddee1 Merge branch 'core/urgent' into x86/urgent, to pick up objtool fix
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-11-03 23:42:16 +01:00
Linus Torvalds
d2ff0ff2c2 Merge tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC fixes from Olof Johansson:
 "A few fixes who have come in near or during the merge window:

   - Removal of a VLA usage in Marvell mpp platform code

   - Enable some IPMI options for ARM64 servers by default, helps
     testing

   - Enable PREEMPT on 32-bit ARMv7 defconfig

   - Minor fix for stm32 DT (removal of an unused DMA property)

   - Bugfix for TI OMAP1-based ams-delta (-EINVAL -> IRQ_NOTCONNECTED)"

* tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
  ARM: dts: stm32: update HASH1 dmas property on stm32mp157c
  ARM: orion: avoid VLA in orion_mpp_conf
  ARM: defconfig: Update multi_v7 to use PREEMPT
  arm64: defconfig: Enable some IPMI configs
  soc: ti: QMSS: Fix usage of irq_set_affinity_hint
  ARM: OMAP1: ams-delta: Fix impossible .irq < 0
2018-11-03 12:13:57 -07:00
Linus Torvalds
83650fd58a Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull more arm64 updates from Catalin Marinas:

 - fix W+X page (mark RO) allocated by the arm64 kprobes code

 - Makefile fix for .i files in out of tree modules

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64: kprobe: make page to RO mode when allocate it
  arm64: kdump: fix small typo
  arm64: makefile fix build of .i file in external module case
2018-11-03 10:55:23 -07:00
Linus Torvalds
3308a383ce Merge tag 'dma-mapping-4.20-2' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fix from Christoph Hellwig:
 "Avoid compile warnings on non-default arm64 configs"

* tag 'dma-mapping-4.20-2' of git://git.infradead.org/users/hch/dma-mapping:
  arm64: fix warnings without CONFIG_IOMMU_DMA
2018-11-03 10:53:33 -07:00
Linus Torvalds
9a12efc5e0 Merge tag 'kbuild-v4.20-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:

 - clean-up leftovers in Kconfig files

 - remove stale oldnoconfig and silentoldconfig targets

 - remove unneeded cc-fullversion and cc-name variables

 - improve merge_config script to allow overriding option prefix

* tag 'kbuild-v4.20-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kbuild: remove cc-name variable
  kbuild: replace cc-name test with CONFIG_CC_IS_CLANG
  merge_config.sh: Allow to define config prefix
  kbuild: remove unused cc-fullversion variable
  kconfig: remove silentoldconfig target
  kconfig: remove oldnoconfig target
  powerpc: PCI_MSI needs PCI
  powerpc: remove CONFIG_MCA leftovers
  powerpc: remove CONFIG_PCI_QSPAN
  scsi: aha152x: rename the PCMCIA define
2018-11-03 10:47:33 -07:00
Linus Torvalds
169447287b Merge tag '4.20-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes and updates from Steve French:
 "Three small fixes (one Kerberos related, one for stable, and another
  fixes an oops in xfstest 377), two helpful debugging improvements,
  three patches for cifs directio and some minor cleanup"

* tag '4.20-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: fix signed/unsigned mismatch on aio_read patch
  cifs: don't dereference smb_file_target before null check
  CIFS: Add direct I/O functions to file_operations
  CIFS: Add support for direct I/O write
  CIFS: Add support for direct I/O read
  smb3: missing defines and structs for reparse point handling
  smb3: allow more detailed protocol info on open files for debugging
  smb3: on kerberos mount if server doesn't specify auth type use krb5
  smb3: add trace point for tree connection
  cifs: fix spelling mistake, EACCESS -> EACCES
  cifs: fix return value for cifs_listxattr
2018-11-03 10:45:55 -07:00
Linus Torvalds
ed61a132cb Merge branch 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull 9p fix from Al Viro:
 "Regression fix for net/9p handling of iov_iter; broken by braino when
  switching to iov_iter_is_kvec() et.al., spotted and fixed by Marc"

* 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  iov_iter: Fix 9p virtio breakage
2018-11-03 10:35:52 -07:00
Linus Torvalds
af102b333a Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull more SCSI updates from James Bottomley:
 "This is a set of minor small (and safe changes) that didn't make the
  initial pull request plus some bug fixes"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: mvsas: Remove set but not used variable 'id'
  scsi: qla2xxx: Remove two arguments from qlafx00_error_entry()
  scsi: qla2xxx: Make sure that qlafx00_ioctl_iosb_entry() initializes 'res'
  scsi: qla2xxx: Remove a set-but-not-used variable
  scsi: qla2xxx: Make qla2x00_sysfs_write_nvram() easier to analyze
  scsi: qla2xxx: Declare local functions 'static'
  scsi: qla2xxx: Improve several kernel-doc headers
  scsi: qla2xxx: Modify fall-through annotations
  scsi: 3w-sas: 3w-9xxx: Use unsigned char for cdb
  scsi: mvsas: Use dma_pool_zalloc
  scsi: target: Don't request modules that aren't even built
  scsi: target: Set response length for REPORT TARGET PORT GROUPS
2018-11-03 10:34:03 -07:00
Linus Torvalds
cddfa11aef Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:

 - more ocfs2 work

 - various leftovers

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  memory_hotplug: cond_resched in __remove_pages
  bfs: add sanity check at bfs_fill_super()
  kernel/sysctl.c: remove duplicated include
  kernel/kexec_file.c: remove some duplicated includes
  mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask
  ocfs2: fix clusters leak in ocfs2_defrag_extent()
  ocfs2: dlmglue: clean up timestamp handling
  ocfs2: don't put and assigning null to bh allocated outside
  ocfs2: fix a misuse a of brelse after failing ocfs2_check_dir_entry
  ocfs2: don't use iocb when EIOCBQUEUED returns
  ocfs2: without quota support, avoid calling quota recovery
  ocfs2: remove ocfs2_is_o2cb_active()
  mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  include/linux/notifier.h: SRCU: fix ctags
  mm: handle no memcg case in memcg_kmem_charge() properly
2018-11-03 10:21:43 -07:00
Michal Hocko
dd33ad7b25 memory_hotplug: cond_resched in __remove_pages
We have received a bug report that unbinding a large pmem (>1TB) can
result in a soft lockup:

  NMI watchdog: BUG: soft lockup - CPU#9 stuck for 23s! [ndctl:4365]
  [...]
  Supported: Yes
  CPU: 9 PID: 4365 Comm: ndctl Not tainted 4.12.14-94.40-default #1 SLE12-SP4
  Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS SE5C620.86B.01.00.0833.051120182255 05/11/2018
  task: ffff9cce7d4410c0 task.stack: ffffbe9eb1bc4000
  RIP: 0010:__put_page+0x62/0x80
  Call Trace:
   devm_memremap_pages_release+0x152/0x260
   release_nodes+0x18d/0x1d0
   device_release_driver_internal+0x160/0x210
   unbind_store+0xb3/0xe0
   kernfs_fop_write+0x102/0x180
   __vfs_write+0x26/0x150
   vfs_write+0xad/0x1a0
   SyS_write+0x42/0x90
   do_syscall_64+0x74/0x150
   entry_SYSCALL_64_after_hwframe+0x3d/0xa2
  RIP: 0033:0x7fd13166b3d0

It has been reported on an older (4.12) kernel but the current upstream
code doesn't cond_resched in the hot remove code at all and the given
range to remove might be really large.  Fix the issue by calling
cond_resched once per memory section.

Link: http://lkml.kernel.org/r/20181031125840.23982-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Dan Williams <dan.j.williams@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-03 10:09:38 -07:00
Tetsuo Handa
9f2df09a33 bfs: add sanity check at bfs_fill_super()
syzbot is reporting too large memory allocation at bfs_fill_super() [1].
Since file system image is corrupted such that bfs_sb->s_start == 0,
bfs_fill_super() is trying to allocate 8MB of continuous memory. Fix
this by adding a sanity check on bfs_sb->s_start, __GFP_NOWARN and
printf().

[1] https://syzkaller.appspot.com/bug?id=16a87c236b951351374a84c8a32f40edbc034e96

Link: http://lkml.kernel.org/r/1525862104-3407-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+71c6b5d68e91149fc8a4@syzkaller.appspotmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Tigran Aivazian <aivazian.tigran@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-03 10:09:38 -07:00
Michael Schupikov
6f0483d1f9 kernel/sysctl.c: remove duplicated include
Remove one include of <linux/pipe_fs_i.h>.
No functional changes.

Link: http://lkml.kernel.org/r/20181004134223.17735-1-michael@schupikov.de
Signed-off-by: Michael Schupikov <michael@schupikov.de>
Reviewed-by: Richard Weinberger <richard@nod.at>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-03 10:09:37 -07:00
zhong jiang
3383b36040 kernel/kexec_file.c: remove some duplicated includes
We include kexec.h and slab.h twice in kexec_file.c. It's unnecessary.
hence just remove them.

Link: http://lkml.kernel.org/r/1537498098-19171-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Reviewed-by: Bhupesh Sharma <bhsharma@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-03 10:09:37 -07:00
Michal Hocko
89c83fb539 mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask
THP allocation mode is quite complex and it depends on the defrag mode.
This complexity is hidden in alloc_hugepage_direct_gfpmask from a large
part currently. The NUMA special casing (namely __GFP_THISNODE) is
however independent and placed in alloc_pages_vma currently. This both
adds an unnecessary branch to all vma based page allocation requests and
it makes the code more complex unnecessarily as well. Not to mention
that e.g. shmem THP used to do the node reclaiming unconditionally
regardless of the defrag mode until recently. This was not only
unexpected behavior but it was also hardly a good default behavior and I
strongly suspect it was just a side effect of the code sharing more than
a deliberate decision which suggests that such a layering is wrong.

Get rid of the thp special casing from alloc_pages_vma and move the
logic to alloc_hugepage_direct_gfpmask. __GFP_THISNODE is applied to the
resulting gfp mask only when the direct reclaim is not requested and
when there is no explicit numa binding to preserve the current logic.

Please note that there's also a slight difference wrt MPOL_BIND now. The
previous code would avoid using __GFP_THISNODE if the local node was
outside of policy_nodemask(). After this patch __GFP_THISNODE is avoided
for all MPOL_BIND policies. So there's a difference that if local node
is actually allowed by the bind policy's nodemask, previously
__GFP_THISNODE would be added, but now it won't be. From the behavior
POV this is still correct because the policy nodemask is used.

Link: http://lkml.kernel.org/r/20180925120326.24392-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-03 10:09:37 -07:00
Larry Chen
6194ae4242 ocfs2: fix clusters leak in ocfs2_defrag_extent()
ocfs2_defrag_extent() might leak allocated clusters.  When the file
system has insufficient space, the number of claimed clusters might be
less than the caller wants.  If that happens, the original code might
directly commit the transaction without returning clusters.

This patch is based on code in ocfs2_add_clusters_in_btree().

[akpm@linux-foundation.org: include localalloc.h, reduce scope of data_ac]
Link: http://lkml.kernel.org/r/20180904041621.16874-3-lchen@suse.com
Signed-off-by: Larry Chen <lchen@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-03 10:09:37 -07:00
Arnd Bergmann
3a3d1e5104 ocfs2: dlmglue: clean up timestamp handling
The handling of timestamps outside of the 1970..2038 range in the dlm
glue is rather inconsistent: on 32-bit architectures, this has always
wrapped around to negative timestamps in the 1902..1969 range, while on
64-bit kernels all timestamps are interpreted as positive 34 bit numbers
in the 1970..2514 year range.

Now that the VFS code handles 64-bit timestamps on all architectures, we
can make the behavior more consistent here, and return the same result
that we had on 64-bit already, making the file system y2038 safe in the
process.  Outside of dlmglue, it already uses 64-bit on-disk timestamps
anway, so that part is fine.

For consistency, I'm changing ocfs2_pack_timespec() to clamp anything
outside of the supported range to the minimum and maximum values.  This
avoids a possible ambiguity of values before 1970 in particular, which
used to be interpreted as times at the end of the 2514 range previously.

Link: http://lkml.kernel.org/r/20180619155826.4106487-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-03 10:09:37 -07:00
Changwei Ge
cf76c78595 ocfs2: don't put and assigning null to bh allocated outside
ocfs2_read_blocks() and ocfs2_read_blocks_sync() are both used to read
several blocks from disk.  Currently, the input argument *bhs* can be
NULL or NOT.  It depends on the caller's behavior.  If the function
fails in reading blocks from disk, the corresponding bh will be assigned
to NULL and put.

Obviously, above process for non-NULL input bh is not appropriate.
Because the caller doesn't even know its bhs are put and re-assigned.

If buffer head is managed by caller, ocfs2_read_blocks and
ocfs2_read_blocks_sync() should not evaluate it to NULL.  It will cause
caller accessing illegal memory, thus crash.

Link: http://lkml.kernel.org/r/HK2PR06MB045285E0F4FBB561F9F2F9B3D5680@HK2PR06MB0452.apcprd06.prod.outlook.com
Signed-off-by: Changwei Ge <ge.changwei@h3c.com>
Reviewed-by: Guozhonghua <guozhonghua@h3c.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-03 10:09:37 -07:00
Changwei Ge
29aa30167a ocfs2: fix a misuse a of brelse after failing ocfs2_check_dir_entry
Somehow, file system metadata was corrupted, which causes
ocfs2_check_dir_entry() to fail in function ocfs2_dir_foreach_blk_el().

According to the original design intention, if above happens we should
skip the problematic block and continue to retrieve dir entry.  But
there is obviouse misuse of brelse around related code.

After failure of ocfs2_check_dir_entry(), current code just moves to
next position and uses the problematic buffer head again and again
during which the problematic buffer head is released for multiple times.
I suppose, this a serious issue which is long-lived in ocfs2.  This may
cause other file systems which is also used in a the same host insane.

So we should also consider about bakcporting this patch into linux
-stable.

Link: http://lkml.kernel.org/r/HK2PR06MB045211675B43EED794E597B6D56E0@HK2PR06MB0452.apcprd06.prod.outlook.com
Signed-off-by: Changwei Ge <ge.changwei@h3c.com>
Suggested-by: Changkuo Shi <shi.changkuo@h3c.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-03 10:09:37 -07:00
Changwei Ge
9e98578775 ocfs2: don't use iocb when EIOCBQUEUED returns
When -EIOCBQUEUED returns, it means that aio_complete() will be called
from dio_complete(), which is an asynchronous progress against
write_iter.  Generally, IO is a very slow progress than executing
instruction, but we still can't take the risk to access a freed iocb.

And we do face a BUG crash issue.  Using the crash tool, iocb is
obviously freed already.

  crash> struct -x kiocb ffff881a350f5900
  struct kiocb {
    ki_filp = 0xffff881a350f5a80,
    ki_pos = 0x0,
    ki_complete = 0x0,
    private = 0x0,
    ki_flags = 0x0
  }

And the backtrace shows:
  ocfs2_file_write_iter+0xcaa/0xd00 [ocfs2]
  aio_run_iocb+0x229/0x2f0
  do_io_submit+0x291/0x540
  SyS_io_submit+0x10/0x20
  system_call_fastpath+0x16/0x75

Link: http://lkml.kernel.org/r/1523361653-14439-1-git-send-email-ge.changwei@h3c.com
Signed-off-by: Changwei Ge <ge.changwei@h3c.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-03 10:09:37 -07:00
Guozhonghua
21158ca85b ocfs2: without quota support, avoid calling quota recovery
During one dead node's recovery by other node, quota recovery work will
be queued.  We should avoid calling quota when it is not supported, so
check the quota flags.

Link: http://lkml.kernel.org/r/71604351584F6A4EBAE558C676F37CA401071AC9FB@H3CMLB12-EX.srv.huawei-3com.com
Signed-off-by: guozhonghua <guozhonghua@h3c.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-03 10:09:37 -07:00
Gang He
a634644751 ocfs2: remove ocfs2_is_o2cb_active()
Remove ocfs2_is_o2cb_active().  We have similar functions to identify
which cluster stack is being used via osb->osb_cluster_stack.

Secondly, the current implementation of ocfs2_is_o2cb_active() is not
totally safe.  Based on the design of stackglue, we need to get
ocfs2_stack_lock before using ocfs2_stack related data structures, and
that active_stack pointer can be NULL in the case of mount failure.

Link: http://lkml.kernel.org/r/1495441079-11708-1-git-send-email-ghe@suse.com
Signed-off-by: Gang He <ghe@suse.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Reviewed-by: Eric Ren <zren@suse.com>
Acked-by: Changwei Ge <ge.changwei@h3c.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-03 10:09:37 -07:00
Andrea Arcangeli
ac5b2c1891 mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
THP allocation might be really disruptive when allocated on NUMA system
with the local node full or hard to reclaim.  Stefan has posted an
allocation stall report on 4.12 based SLES kernel which suggests the
same issue:

  kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
  kvm cpuset=/ mems_allowed=0-1
  CPU: 10 PID: 84752 Comm: kvm Tainted: G        W 4.12.0+98-ph <a href="/view.php?id=1" title="[geschlossen] Integration Ramdisk" class="resolved">0000001</a> SLE15 (unreleased)
  Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
  Call Trace:
   dump_stack+0x5c/0x84
   warn_alloc+0xe0/0x180
   __alloc_pages_slowpath+0x820/0xc90
   __alloc_pages_nodemask+0x1cc/0x210
   alloc_pages_vma+0x1e5/0x280
   do_huge_pmd_wp_page+0x83f/0xf00
   __handle_mm_fault+0x93d/0x1060
   handle_mm_fault+0xc6/0x1b0
   __do_page_fault+0x230/0x430
   do_page_fault+0x2a/0x70
   page_fault+0x7b/0x80
   [...]
  Mem-Info:
  active_anon:126315487 inactive_anon:1612476 isolated_anon:5
   active_file:60183 inactive_file:245285 isolated_file:0
   unevictable:15657 dirty:286 writeback:1 unstable:0
   slab_reclaimable:75543 slab_unreclaimable:2509111
   mapped:81814 shmem:31764 pagetables:370616 bounce:0
   free:32294031 free_pcp:6233 free_cma:0
  Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
  Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no

The defrag mode is "madvise" and from the above report it is clear that
the THP has been allocated for MADV_HUGEPAGA vma.

Andrea has identified that the main source of the problem is
__GFP_THISNODE usage:

: The problem is that direct compaction combined with the NUMA
: __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
: hard the local node, instead of failing the allocation if there's no
: THP available in the local node.
:
: Such logic was ok until __GFP_THISNODE was added to the THP allocation
: path even with MPOL_DEFAULT.
:
: The idea behind the __GFP_THISNODE addition, is that it is better to
: provide local memory in PAGE_SIZE units than to use remote NUMA THP
: backed memory. That largely depends on the remote latency though, on
: threadrippers for example the overhead is relatively low in my
: experience.
:
: The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
: extremely slow qemu startup with vfio, if the VM is larger than the
: size of one host NUMA node. This is because it will try very hard to
: unsuccessfully swapout get_user_pages pinned pages as result of the
: __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
: allocations and instead of trying to allocate THP on other nodes (it
: would be even worse without vfio type1 GUP pins of course, except it'd
: be swapping heavily instead).

Fix this by removing __GFP_THISNODE for THP requests which are
requesting the direct reclaim.  This effectivelly reverts 5265047ac3
on the grounds that the zone/node reclaim was known to be disruptive due
to premature reclaim when there was memory free.  While it made sense at
the time for HPC workloads without NUMA awareness on rare machines, it
was ultimately harmful in the majority of cases.  The existing behaviour
is similar, if not as widespare as it applies to a corner case but
crucially, it cannot be tuned around like zone_reclaim_mode can.  The
default behaviour should always be to cause the least harm for the
common case.

If there are specialised use cases out there that want zone_reclaim_mode
in specific cases, then it can be built on top.  Longterm we should
consider a memory policy which allows for the node reclaim like behavior
for the specific memory ranges which would allow a

[1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com

Mel said:

: Both patches look correct to me but I'm responding to this one because
: it's the fix.  The change makes sense and moves further away from the
: severe stalling behaviour we used to see with both THP and zone reclaim
: mode.
:
: I put together a basic experiment with usemem configured to reference a
: buffer multiple times that is 80% the size of main memory on a 2-socket
: box with symmetric node sizes and defrag set to "always".  The defrag
: setting is not the default but it would be functionally similar to
: accessing a buffer with madvise(MADV_HUGEPAGE).  Usemem is configured to
: reference the buffer multiple times and while it's not an interesting
: workload, it would be expected to complete reasonably quickly as it fits
: within memory.  The results were;
:
: usemem
:                                   vanilla           noreclaim-v1
: Amean     Elapsd-1       42.78 (   0.00%)       26.87 (  37.18%)
: Amean     Elapsd-3       27.55 (   0.00%)        7.44 (  73.00%)
: Amean     Elapsd-4        5.72 (   0.00%)        5.69 (   0.45%)
:
: This shows the elapsed time in seconds for 1 thread, 3 threads and 4
: threads referencing buffers 80% the size of memory.  With the patches
: applied, it's 37.18% faster for the single thread and 73% faster with two
: threads.  Note that 4 threads showing little difference does not indicate
: the problem is related to thread counts.  It's simply the case that 4
: threads gets spread so their workload mostly fits in one node.
:
: The overall view from /proc/vmstats is more startling
:
:                          4.19.0-rc1  4.19.0-rc1
:                             vanillanoreclaim-v1r1
: Minor Faults               35593425      708164
: Major Faults                 484088          36
: Swap Ins                    3772837           0
: Swap Outs                   3932295           0
:
: Massive amounts of swap in/out without the patch
:
: Direct pages scanned        6013214           0
: Kswapd pages scanned              0           0
: Kswapd pages reclaimed            0           0
: Direct pages reclaimed      4033009           0
:
: Lots of reclaim activity without the patch
:
: Kswapd efficiency              100%        100%
: Kswapd velocity               0.000       0.000
: Direct efficiency               67%        100%
: Direct velocity           11191.956       0.000
:
: Mostly from direct reclaim context as you'd expect without the patch.
:
: Page writes by reclaim  3932314.000       0.000
: Page writes file                 19           0
: Page writes anon            3932295           0
: Page reclaim immediate        42336           0
:
: Writes from reclaim context is never good but the patch eliminates it.
:
: We should never have default behaviour to thrash the system for such a
: basic workload.  If zone reclaim mode behaviour is ever desired but on a
: single task instead of a global basis then the sensible option is to build
: a mempolicy that enforces that behaviour.

This was a severe regression compared to previous kernels that made
important workloads unusable and it starts when __GFP_THISNODE was
added to THP allocations under MADV_HUGEPAGE.  It is not a significant
risk to go to the previous behavior before __GFP_THISNODE was added, it
worked like that for years.

This was simply an optimization to some lucky workloads that can fit in
a single node, but it ended up breaking the VM for others that can't
possibly fit in a single node, so going back is safe.

[mhocko@suse.com: rewrote the changelog based on the one from Andrea]
Link: http://lkml.kernel.org/r/20180925120326.24392-2-mhocko@kernel.org
Fixes: 5265047ac3 ("mm, thp: really limit transparent hugepage allocation to local node")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Debugged-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Mel Gorman <mgorman@techsingularity.net>
Tested-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: <stable@vger.kernel.org>	[4.1+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-03 10:09:37 -07:00