find_next_iomem_res() shows up to be a source for overhead in dax
benchmarks.
Improve performance by not considering children of the tree if the top
level does not match. Since the range of the parents should include the
range of the children such check is redundant.
Running sysbench on dax (pmem emulation, with write_cache disabled):
sysbench fileio --file-total-size=3G --file-test-mode=rndwr \
--file-io-mode=mmap --threads=4 --file-fsync-mode=fdatasync run
Provides the following results:
events (avg/stddev)
-------------------
5.2-rc3: 1247669.0000/16075.39
w/patch: 1286320.5000/16402.72 (+3%)
Link: http://lkml.kernel.org/r/20190613045903.4922-3-namit@vmware.com
Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 7635d9cbe8 ("mm, thp, proc: report THP eligibility for each
vma") introduced THPeligible bit for processes' smaps. But, when
checking the eligibility for shmem vma, __transparent_hugepage_enabled()
is called to override the result from shmem_huge_enabled(). It may
result in the anonymous vma's THP flag override shmem's. For example,
running a simple test which create THP for shmem, but with anonymous THP
disabled, when reading the process's smaps, it may show:
7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test
Size: 4096 kB
...
[snip]
...
ShmemPmdMapped: 4096 kB
...
[snip]
...
THPeligible: 0
And, /proc/meminfo does show THP allocated and PMD mapped too:
ShmemHugePages: 4096 kB
ShmemPmdMapped: 4096 kB
This doesn't make too much sense. The shmem objects should be treated
separately from anonymous THP. Calling shmem_huge_enabled() with
checking MMF_DISABLE_THP sounds good enough. And, we could skip stack
and dax vma check since we already checked if the vma is shmem already.
Also check if vma is suitable for THP by calling
transhuge_vma_suitable().
And minor fix to smaps output format and documentation.
Link: http://lkml.kernel.org/r/1560401041-32207-3-git-send-email-yang.shi@linux.alibaba.com
Fixes: 7635d9cbe8 ("mm, thp, proc: report THP eligibility for each vma")
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Unlike legacy chips, 57500 chips don't need additional VNIC resources
for aRFS/ntuple. Fix the code accordingly so that we don't reserve
and allocate additional VNICs on 57500 chips. Without this patch,
the driver is failing to initialize when it tries to allocate extra
VNICs.
Fixes: ac33906c67 ("bnxt_en: Add support for aRFS on 57500 chips.")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Variables allocated by kvzalloc should not be freed by kfree.
Because they may be allocated by vmalloc.
So we replace kfree with kvfree here.
Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For good reason, the standard device_lock() is marked
lockdep_set_novalidate_class() because there is simply no sane way to
describe the myriad ways the device_lock() ordered with other locks.
However, that leaves subsystems that know their own local device_lock()
ordering rules to find lock ordering mistakes manually. Instead,
introduce an optional / additional lockdep-enabled lock that a subsystem
can acquire in all the same paths that the device_lock() is acquired.
A conversion of the NFIT driver and NVDIMM subsystem to a
lockdep-validate device_lock() scheme is included. The
debug_nvdimm_lock() implementation implements the correct lock-class and
stacking order for the libnvdimm device topology hierarchy.
Yes, this is a hack, but hopefully it is a useful hack for other
subsystems device_lock() debug sessions. Quoting Greg:
"Yeah, it feels a bit hacky but it's really up to a subsystem to mess up
using it as much as anything else, so user beware :)
I don't object to it if it makes things easier for you to debug."
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/156341210661.292348.7014034644265455704.stgit@dwillia2-desk3.amr.corp.intel.com
A multithreaded namespace creation/destruction stress test currently
deadlocks with the following lockup signature:
INFO: task ndctl:2924 blocked for more than 122 seconds.
Tainted: G OE 5.2.0-rc4+ #3382
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ndctl D 0 2924 1176 0x00000000
Call Trace:
? __schedule+0x27e/0x780
schedule+0x30/0xb0
wait_nvdimm_bus_probe_idle+0x8a/0xd0 [libnvdimm]
? finish_wait+0x80/0x80
uuid_store+0xe6/0x2e0 [libnvdimm]
kernfs_fop_write+0xf0/0x1a0
vfs_write+0xb7/0x1b0
ksys_write+0x5c/0xd0
do_syscall_64+0x60/0x240
INFO: task ndctl:2923 blocked for more than 122 seconds.
Tainted: G OE 5.2.0-rc4+ #3382
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ndctl D 0 2923 1175 0x00000000
Call Trace:
? __schedule+0x27e/0x780
? __mutex_lock+0x489/0x910
schedule+0x30/0xb0
schedule_preempt_disabled+0x11/0x20
__mutex_lock+0x48e/0x910
? nvdimm_namespace_common_probe+0x95/0x4d0 [libnvdimm]
? __lock_acquire+0x23f/0x1710
? nvdimm_namespace_common_probe+0x95/0x4d0 [libnvdimm]
nvdimm_namespace_common_probe+0x95/0x4d0 [libnvdimm]
__dax_pmem_probe+0x5e/0x210 [dax_pmem_core]
? nvdimm_bus_probe+0x1d0/0x2c0 [libnvdimm]
dax_pmem_probe+0xc/0x20 [dax_pmem]
nvdimm_bus_probe+0x90/0x2c0 [libnvdimm]
really_probe+0xef/0x390
driver_probe_device+0xb4/0x100
In this sequence an 'nd_dax' device is being probed and trying to take
the lock on its backing namespace to validate that the 'nd_dax' device
indeed has exclusive access to the backing namespace. Meanwhile, another
thread is trying to update the uuid property of that same backing
namespace. So one thread is in the probe path trying to acquire the
lock, and the other thread has acquired the lock and tries to flush the
probe path.
Fix this deadlock by not holding the namespace device_lock over the
wait_nvdimm_bus_probe_idle() synchronization step. In turn this requires
the device_lock to be held on entry to wait_nvdimm_bus_probe_idle() and
subsequently dropped internally to wait_nvdimm_bus_probe_idle().
Cc: <stable@vger.kernel.org>
Fixes: bf9bccc14c ("libnvdimm: pmem label sets and namespace instantiation")
Cc: Vishal Verma <vishal.l.verma@intel.com>
Tested-by: Jane Chu <jane.chu@oracle.com>
Link: https://lore.kernel.org/r/156341210094.292348.2384694131126767789.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Namespace activation expects to be able to reference region badblocks.
The following warning sometimes triggers when asynchronous namespace
activation races in front of the completion of namespace probing. Move
all possible namespace probing after region badblocks initialization.
Otherwise, lockdep sometimes catches the uninitialized state of the
badblocks seqlock with stack trace signatures like:
INFO: trying to register non-static key.
pmem2: detected capacity change from 0 to 136365211648
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
CPU: 9 PID: 358 Comm: kworker/u80:5 Tainted: G OE 5.2.0-rc4+ #3382
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
Workqueue: events_unbound async_run_entry_fn
Call Trace:
dump_stack+0x85/0xc0
pmem1.12: detected capacity change from 0 to 8589934592
register_lock_class+0x56a/0x570
? check_object+0x140/0x270
__lock_acquire+0x80/0x1710
? __mutex_lock+0x39d/0x910
lock_acquire+0x9e/0x180
? nd_pfn_validate+0x28f/0x440 [libnvdimm]
badblocks_check+0x93/0x1f0
? nd_pfn_validate+0x28f/0x440 [libnvdimm]
nd_pfn_validate+0x28f/0x440 [libnvdimm]
? lockdep_hardirqs_on+0xf0/0x180
nd_dax_probe+0x9a/0x120 [libnvdimm]
nd_pmem_probe+0x6d/0x180 [nd_pmem]
nvdimm_bus_probe+0x90/0x2c0 [libnvdimm]
Fixes: 48af2f7e52 ("libnvdimm, pfn: during init, clear errors...")
Cc: <stable@vger.kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Link: https://lore.kernel.org/r/156341208365.292348.1547528796026249120.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
A multithreaded namespace creation/destruction stress test currently
fails with signatures like the following:
sysfs group 'power' not found for kobject 'dax1.1'
RIP: 0010:sysfs_remove_group+0x76/0x80
Call Trace:
device_del+0x73/0x370
device_unregister+0x16/0x50
nd_async_device_unregister+0x1e/0x30 [libnvdimm]
async_run_entry_fn+0x39/0x160
process_one_work+0x23c/0x5e0
worker_thread+0x3c/0x390
BUG: kernel NULL pointer dereference, address: 0000000000000020
RIP: 0010:klist_put+0x1b/0x6c
Call Trace:
klist_del+0xe/0x10
device_del+0x8a/0x2c9
? __switch_to_asm+0x34/0x70
? __switch_to_asm+0x40/0x70
device_unregister+0x44/0x4f
nd_async_device_unregister+0x22/0x2d [libnvdimm]
async_run_entry_fn+0x47/0x15a
process_one_work+0x1a2/0x2eb
worker_thread+0x1b8/0x26e
Use the kill_device() helper to atomically resolve the race of multiple
threads issuing kill, device_unregister(), requests.
Reported-by: Jane Chu <jane.chu@oracle.com>
Reported-by: Erwin Tsaur <erwin.tsaur@oracle.com>
Fixes: 4d88a97aa9 ("libnvdimm, nvdimm: dimm driver and base libnvdimm device-driver...")
Cc: <stable@vger.kernel.org>
Link: https://github.com/pmem/ndctl/issues/96
Tested-by: Tested-by: Jane Chu <jane.chu@oracle.com>
Link: https://lore.kernel.org/r/156341207846.292348.10435719262819764054.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The libnvdimm subsystem arranges for devices to be destroyed as a result
of a sysfs operation. Since device_unregister() cannot be called from
an actively running sysfs attribute of the same device libnvdimm
arranges for device_unregister() to be performed in an out-of-line async
context.
The driver core maintains a 'dead' state for coordinating its own racing
async registration / de-registration requests. Rather than add local
'dead' state tracking infrastructure to libnvdimm device objects, export
the existing state tracking via a new kill_device() helper.
The kill_device() helper simply marks the device as dead, i.e. that it
is on its way to device_del(), or returns that the device was already
dead. This can be used in advance of calling device_unregister() for
subsystems like libnvdimm that might need to handle multiple user
threads racing to delete a device.
This refactoring does not change any behavior, but it is a pre-requisite
for follow-on fixes and therefore marked for -stable.
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Fixes: 4d88a97aa9 ("libnvdimm, nvdimm: dimm driver and base libnvdimm device-driver...")
Cc: <stable@vger.kernel.org>
Tested-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/156341207332.292348.14959761496009347574.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Servers can defer destaging any data and updating the mtime until close().
This means that if we do a setinfo to modify the mtime while other handles
are open for write the server may overwrite our setinfo timestamps when
if flushes the file on close() of the writeable handle.
To solve this we add an explicit flush when the mtime is about to
be updated.
This fixes "cp -p" to preserve mtime when copying a file onto an SMB2 share.
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
We can cut one third of the traffic on open by not querying the
inode number explicitly via SMB3 query_info since it is now
returned on open in the qfid context.
This is better in multiple ways, and
speeds up file open about 10% (more if network is slow).
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Pull more device mapper updates from Mike Snitzer:
- Fix zone state management race in DM zoned target by eliminating the
unnecessary DMZ_ACTIVE state.
- A couple fixes for issues the DM snapshot target's optional discard
support added during first week of the 5.3 merge.
- Increase default size of outstanding IO that is allowed for a each
dm-kcopyd client and introduce tunable to allow user adjust.
- Update DM core to use printk ratelimiting functions rather than
duplicate them and in doing so fix an issue where DMDEBUG_LIMIT()
rate limited KERN_DEBUG messages had excessive "callbacks suppressed"
messages.
* tag 'for-5.3/dm-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm: use printk ratelimiting functions
dm kcopyd: Increase default sub-job size to 512KB
dm snapshot: fix oversights in optional discard support
dm zoned: fix zone state management race
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Stable fixes:
- SUNRPC: Ensure bvecs are re-synced when we re-encode the RPC
request
- Fix an Oops in ff_layout_track_ds_error due to a PTR_ERR()
dereference
- Revert buggy NFS readdirplus optimisation
- NFSv4: Handle the special Linux file open access mode
- pnfs: Fix a problem where we gratuitously start doing I/O through
the MDS
Features:
- Allow NFS client to set up multiple TCP connections to the server
using a new 'nconnect=X' mount option. Queue length is used to
balance load.
- Enhance statistics reporting to report on all transports when using
multiple connections.
- Speed up SUNRPC by removing bh-safe spinlocks
- Add a mechanism to allow NFSv4 to request that containers set a
unique per-host identifier for when the hostname is not set.
- Ensure NFSv4 updates the lease_time after a clientid update
Bugfixes and cleanup:
- Fix use-after-free in rpcrdma_post_recvs
- Fix a memory leak when nfs_match_client() is interrupted
- Fix buggy file access checking in NFSv4 open for execute
- disable unsupported client side deduplication
- Fix spurious client disconnections
- Fix occasional RDMA transport deadlock
- Various RDMA cleanups
- Various tracepoint fixes
- Fix the TCP callback channel to guarantee the server can actually
send the number of callback requests that was negotiated at mount
time"
* tag 'nfs-for-5.3-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (68 commits)
pnfs/flexfiles: Add tracepoints for detecting pnfs fallback to MDS
pnfs: Fix a problem where we gratuitously start doing I/O through the MDS
SUNRPC: Optimise transport balancing code
SUNRPC: Ensure the bvecs are reset when we re-encode the RPC request
pnfs/flexfiles: Fix PTR_ERR() dereferences in ff_layout_track_ds_error
NFSv4: Don't use the zero stateid with layoutget
SUNRPC: Fix up backchannel slot table accounting
SUNRPC: Fix initialisation of struct rpc_xprt_switch
SUNRPC: Skip zero-refcount transports
SUNRPC: Replace division by multiplication in calculation of queue length
NFSv4: Validate the stateid before applying it to state recovery
nfs4.0: Refetch lease_time after clientid update
nfs4: Rename nfs41_setup_state_renewal
nfs4: Make nfs4_proc_get_lease_time available for nfs4.0
nfs: Fix copy-and-paste error in debug message
NFS: Replace 16 seq_printf() calls by seq_puts()
NFS: Use seq_putc() in nfs_show_stats()
Revert "NFS: readdirplus optimization by cache mechanism" (memleak)
SUNRPC: Fix transport accounting when caller specifies an rpc_xprt
NFS: Record task, client ID, and XID in xdr_status trace points
...
Alexei Starovoitov says:
====================
pull-request: bpf 2019-07-18
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) verifier precision propagation fix, from Andrii.
2) BTF size fix for typedefs, from Andrii.
3) a bunch of big endian fixes, from Ilya.
4) wide load from bpf_sock_addr fixes, from Stanislav.
5) a bunch of misc fixes from a number of developers.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
test_xdp_noinline fails on s390 due to a handful of endianness issues.
Use ntohs for parsing eth_proto.
Replace bswaps with ntohs/htons.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This test looks up a 32-bit map element and then loads it using a 64-bit
load. This does not work on s390, which is a big-endian machine.
Since the point of this test doesn't seem to be loading a smaller value
using a larger load, simply use a 32-bit load.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add tracepoints to allow debugging of the event chain leading to
a pnfs fallback to doing I/O through the MDS.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Kernel build warns:
'sanitize_boot_params' defined but not used [-Wunused-function]
at below files:
arch/x86/boot/compressed/cmdline.c
arch/x86/boot/compressed/error.c
arch/x86/boot/compressed/early_serial_console.c
arch/x86/boot/compressed/acpi.c
That's becausethey each include misc.h which includes a definition of
sanitize_boot_params() via bootparam_utils.h.
Remove the inclusion from misc.h and have the c file including
bootparam_utils.h directly.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1563283092-1189-1-git-send-email-zhenzhong.duan@oracle.com
If the client has to stop in pnfs_update_layout() to wait for another
layoutget to complete, it currently exits and defaults to I/O through
the MDS if the layoutget was successful.
Fixes: d03360aaf5 ("pNFS: Ensure we return the error if someone kills...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.20+
[Why]:
Some active dongles have DP++ port and DP port at the same time. Current
code doesn't cover DP++ case and processes as default DVI case, in which
audio is disabled. Because of dual mode, DP case is also treat as DVI case
for the other port.
[How]:
According DP 1.4 spec, add DP++ procedure similar with HDMI case. Also
add None dongle type for DP case.
Signed-off-by: Dale Zhao <dale.zhao@amd.com>
Reviewed-by: Wenjing Liu <wenjing.liu@amd.com>
Acked-by: Leo Li <sunpeng.li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why]
DC configures the GSL group for the pipe when pipe_split is enabled
and we're switching flip types (buffered <-> immediate flip) on DCN2.
In order to record what GSL group the pipe is using DC stores it in
the pipe's stream_res. DM is not aware of this internal grouping, nor
is DC resource.
So when DM creates a dc_state context and passes it to DC the current
GSL group is lost - DM never knew about it in the first place.
After 3 immediate flips we run out of GSL groups and we're no longer
able to correctly perform *any* flip for multi-pipe scenarios.
[How]
The gsl_group needs to be copied to the new context.
DM has no insight into GSL grouping and could even potentially create
a brand new context without referencing current hardware state. So this
makes the most sense to have happen in DC.
There are two places where DC can apply a new context:
- dc_commit_state
- dc_commit_updates_for_stream
But what's shared between both of these is apply_ctx_for_surface.
This logic only matters for DCN2, so it can be placed in
dcn20_apply_ctx_for_surface. Before doing any locking (where the GSL
group is setup) we can copy over the GSL groups before committing the
new context.
Signed-off-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Reviewed-by: Hersen Wu <hersen.wu@amd.com>
Acked-by: Leo Li <sunpeng.li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>