Commit Graph

109009 Commits

Author SHA1 Message Date
Daniel Drake
684bec1092 Input: i8042 - enable keyboard wakeups by default when s2idle is used
Previously, on typical consumer laptops, pressing a key on the keyboard
when the system is in suspend would cause it to wake up (default or
unconditional behaviour). This happens because the EC generates a SCI
interrupt in this scenario.

That is no longer true on modern laptops based on Intel WhiskeyLake,
including Acer Swift SF314-55G, Asus UX333FA, Asus UX433FN and Asus
UX533FD. We confirmed with Asus EC engineers that the "Modern Standby"
design has been modified so that the EC no longer generates a SCI
in this case; the keyboard controller itself should be used for wakeup.

In order to retain the standard behaviour of being able to use the
keyboard to wake up the system, enable serio wakeups by default on
platforms that are using s2idle.

Link: https://lkml.kernel.org/r/CAB4CAwfQ0mPMqCLp95TVjw4J0r5zKPWkSvvkK4cpZUGE--w8bQ@mail.gmail.com
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Daniel Drake <drake@endlessm.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
2018-10-01 15:58:53 -07:00
Eric Dumazet
1ad98e9d1b tcp/dccp: fix lockdep issue when SYN is backlogged
In normal SYN processing, packets are handled without listener
lock and in RCU protected ingress path.

But syzkaller is known to be able to trick us and SYN
packets might be processed in process context, after being
queued into socket backlog.

In commit 06f877d613 ("tcp/dccp: fix other lockdep splats
accessing ireq_opt") I made a very stupid fix, that happened
to work mostly because of the regular path being RCU protected.

Really the thing protecting ireq->ireq_opt is RCU read lock,
and the pseudo request refcnt is not relevant.

This patch extends what I did in commit 449809a66c ("tcp/dccp:
block BH for SYN processing") by adding an extra rcu_read_{lock|unlock}
pair in the paths that might be taken when processing SYN from
socket backlog (thus possibly in process context)

Fixes: 06f877d613 ("tcp/dccp: fix other lockdep splats accessing ireq_opt")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-01 15:42:13 -07:00
David S. Miller
c8424ddd97 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for your net tree:

1) Skip ip_sabotage_in() for packet making into the VRF driver,
   otherwise packets are dropped, from David Ahern.

2) Clang compilation warning uncovering typo in the
   nft_validate_register_store() call from nft_osf, from Stefan Agner.

3) Double sizeof netlink message length calculations in ctnetlink,
   from zhong jiang.

4) Missing rb_erase() on batch full in rbtree garbage collector,
   from Taehee Yoo.

5) Calm down compilation warning in nf_hook(), from Florian Westphal.

6) Missing check for non-null sk in xt_socket before validating
   netns procedence, from Flavio Leitner.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-01 15:41:01 -07:00
James Smart
783f4a4408 nvme: call nvme_complete_rq when nvmf_check_ready fails for mpath I/O
When an io is rejected by nvmf_check_ready() due to validation of the
controller state, the nvmf_fail_nonready_command() will normally return
BLK_STS_RESOURCE to requeue and retry.  However, if the controller is
dying or the I/O is marked for NVMe multipath, the I/O is failed so that
the controller can terminate or so that the io can be issued on a
different path.  Unfortunately, as this reject point is before the
transport has accepted the command, blk-mq ends up completing the I/O
and never calls nvme_complete_rq(), which is where multipath may preserve
or re-route the I/O. The end result is, the device user ends up seeing an
EIO error.

Example: single path connectivity, controller is under load, and a reset
is induced.  An I/O is received:

  a) while the reset state has been set but the queues have yet to be
     stopped; or
  b) after queues are started (at end of reset) but before the reconnect
     has completed.

The I/O finishes with an EIO status.

This patch makes the following changes:

  - Adds the HOST_PATH_ERROR pathing status from TP4028
  - Modifies the reject point such that it appears to queue successfully,
    but actually completes the io with the new pathing status and calls
    nvme_complete_rq().
  - nvme_complete_rq() recognizes the new status, avoids resetting the
    controller (likely was already done in order to get this new status),
    and calls the multipather to clear the current path that errored.
    This allows the next command (retry or new command) to select a new
    path if there is one.

Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-10-01 14:16:13 -07:00
Ville Syrjälä
468d6a4996 video/hdmi: Constify infoframe passed to the log functions
The log functions don't modify the passed in infoframe so make it const.

Cc: Thierry Reding <thierry.reding@gmail.com>
Cc: Hans Verkuil <hans.verkuil@cisco.com>
Cc: linux-media@vger.kernel.org
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20180920185145.1912-4-ville.syrjala@linux.intel.com
Acked-by: Hans Verkuil <hans.verkuil@cisco.com>
2018-10-01 22:00:28 +03:00
Ville Syrjälä
480b8b3e42 video/hdmi: Pass buffer size to infoframe unpack functions
To make sure the infoframe unpack functions don't end up examining
stack garbage or oopsing, let's pass in the size of the buffer.

v2: Convert tda1997x.c as well (kbuild test robot)

Cc: Thierry Reding <thierry.reding@gmail.com>
Cc: Hans Verkuil <hans.verkuil@cisco.com>
Cc: linux-media@vger.kernel.org
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20180920185145.1912-3-ville.syrjala@linux.intel.com
Acked-by: Hans Verkuil <hans.verkuil@cisco.com>
2018-10-01 21:56:14 +03:00
Ville Syrjälä
f26e1de5ec video/hdmi: Constify 'buffer' to the unpack functions
The unpack functions just read from the passed in buffer,
so make it const.

Cc: Thierry Reding <thierry.reding@gmail.com>
Cc: Hans Verkuil <hans.verkuil@cisco.com>
Cc: linux-media@vger.kernel.org
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20180920185145.1912-2-ville.syrjala@linux.intel.com
Acked-by: Hans Verkuil <hans.verkuil@cisco.com>
2018-10-01 21:35:56 +03:00
Alaa Hleihel
59c9d35ea9 net/mlx5: Cache the system image guid
The system image guid is a read-only field which is used by the TC
offloads code to determine if two mlx5 devices belong to the same
ASIC while adding flows.

Read this once and save it on the core device rather than querying each
time an offloaded flow is added.

Signed-off-by: Alaa Hleihel <alaa@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-10-01 11:32:47 -07:00
Eric Whitney
9fe671496b ext4: adjust reserved cluster count when removing extents
Modify ext4_ext_remove_space() and the code it calls to correct the
reserved cluster count for pending reservations (delayed allocated
clusters shared with allocated blocks) when a block range is removed
from the extent tree.  Pending reservations may be found for the clusters
at the ends of written or unwritten extents when a block range is removed.
If a physical cluster at the end of an extent is freed, it's necessary
to increment the reserved cluster count to maintain correct accounting
if the corresponding logical cluster is shared with at least one
delayed and unwritten extent as found in the extents status tree.

Add a new function, ext4_rereserve_cluster(), to reapply a reservation
on a delayed allocated cluster sharing blocks with a freed allocated
cluster.  To avoid ENOSPC on reservation, a flag is applied to
ext4_free_blocks() to briefly defer updating the freeclusters counter
when an allocated cluster is freed.  This prevents another thread
from allocating the freed block before the reservation can be reapplied.

Redefine the partial cluster object as a struct to carry more state
information and to clarify the code using it.

Adjust the conditional code structure in ext4_ext_remove_space to
reduce the indentation level in the main body of the code to improve
readability.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-10-01 14:25:08 -04:00
Eric Whitney
0b02f4c0d6 ext4: fix reserved cluster accounting at delayed write time
The code in ext4_da_map_blocks sometimes reserves space for more
delayed allocated clusters than it should, resulting in premature
ENOSPC, exceeded quota, and inaccurate free space reporting.

Fix this by checking for written and unwritten blocks shared in the
same cluster with the newly delayed allocated block.  A cluster
reservation should not be made for a cluster for which physical space
has already been allocated.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-10-01 14:19:37 -04:00
Eric Whitney
ad431025ae ext4: generalize extents status tree search functions
Ext4 contains a few functions that are used to search for delayed
extents or blocks in the extents status tree.  Rather than duplicate
code to add new functions to search for extents with different status
values, such as written or a combination of delayed and unwritten,
generalize the existing code to search for caller-specified extents
status values.  Also, move this code into extents_status.c where it
is better associated with the data structures it operates upon, and
where it can be more readily used to implement new extents status tree
functions that might want a broader scope for i_es_lock.

Three missing static specifiers in RFC version of patch reported and
fixed by Fengguang Wu <fengguang.wu@intel.com>.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-10-01 14:10:39 -04:00
Alaa Hleihel
4d8fcf216c net/mlx5e: Avoid unbounded peer devices when unpairing TC hairpin rules
If the peer device was already unbound, then do not attempt to modify
it's resources, otherwise we will crash on dereferencing non-existing
device.

Fixes: 5c65c564c9 ("net/mlx5e: Support offloading TC NIC hairpin flows")
Signed-off-by: Alaa Hleihel <alaa@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-10-01 10:58:00 -07:00
Joerg Roedel
6f20a97e09 Merge branch 'for-joerg/arm-smmu/updates' of git://git.kernel.org/pub/scm/linux/kernel/git/will/linux into arm/smmu 2018-10-01 17:28:13 +02:00
Jens Axboe
c0aac682fa Merge tag 'v4.19-rc6' into for-4.20/block
Merge -rc6 in, for two reasons:

1) Resolve a trivial conflict in the blk-mq-tag.c documentation
2) A few important regression fixes went into upstream directly, so
   they aren't in the 4.20 branch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

* tag 'v4.19-rc6': (780 commits)
  Linux 4.19-rc6
  MAINTAINERS: fix reference to moved drivers/{misc => auxdisplay}/panel.c
  cpufreq: qcom-kryo: Fix section annotations
  perf/core: Add sanity check to deal with pinned event failure
  xen/blkfront: correct purging of persistent grants
  Revert "xen/blkfront: When purging persistent grants, keep them in the buffer"
  selftests/powerpc: Fix Makefiles for headers_install change
  blk-mq: I/O and timer unplugs are inverted in blktrace
  dax: Fix deadlock in dax_lock_mapping_entry()
  x86/boot: Fix kexec booting failure in the SEV bit detection code
  bcache: add separate workqueue for journal_write to avoid deadlock
  drm/amd/display: Fix Edid emulation for linux
  drm/amd/display: Fix Vega10 lightup on S3 resume
  drm/amdgpu: Fix vce work queue was not cancelled when suspend
  Revert "drm/panel: Add device_link from panel device to DRM device"
  xen/blkfront: When purging persistent grants, keep them in the buffer
  clocksource/drivers/timer-atmel-pit: Properly handle error cases
  block: fix deadline elevator drain for zoned block devices
  ACPI / hotplug / PCI: Don't scan for non-hotplug bridges if slot is not bridge
  drm/syncobj: Don't leak fences when WAIT_FOR_SUBMIT is set
  ...

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-01 08:58:57 -06:00
Christoph Hellwig
b4ebe60632 dma-direct: implement complete bus_dma_mask handling
Instead of rejecting devices with a too small bus_dma_mask we can handle
by taking the bus dma_mask into account for allocations and bounce
buffering decisions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-10-01 07:28:03 -07:00
Christoph Hellwig
a20bb05837 dma-direct: add an explicit dma_direct_get_required_mask
This is somewhat modelled after the powerpc version, and differs from
the legacy fallback in use fls64 instead of pointlessly splitting up the
address into low and high dwords and in that it takes (__)phys_to_dma
into account.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
2018-10-01 07:27:15 -07:00
Christoph Hellwig
c6d4381220 dma-mapping: make the get_required_mask method available unconditionally
This save some duplication for ia64, and makes the interface more
general.  In the long run we want each dma_map_ops instance to fill this
out, but this will take a little more prep work.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2018-10-01 07:27:00 -07:00
Roman Gushchin
b741f16303 bpf: introduce per-cpu cgroup local storage
This commit introduced per-cpu cgroup local storage.

Per-cpu cgroup local storage is very similar to simple cgroup storage
(let's call it shared), except all the data is per-cpu.

The main goal of per-cpu variant is to implement super fast
counters (e.g. packet counters), which don't require neither
lookups, neither atomic operations.

>From userspace's point of view, accessing a per-cpu cgroup storage
is similar to other per-cpu map types (e.g. per-cpu hashmaps and
arrays).

Writing to a per-cpu cgroup storage is not atomic, but is performed
by copying longs, so some minimal atomicity is here, exactly
as with other per-cpu maps.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-10-01 16:18:32 +02:00
Roman Gushchin
f294b37ec7 bpf: rework cgroup storage pointer passing
To simplify the following introduction of per-cpu cgroup storage,
let's rework a bit a mechanism of passing a pointer to a cgroup
storage into the bpf_get_local_storage(). Let's save a pointer
to the corresponding bpf_cgroup_storage structure, instead of
a pointer to the actual buffer.

It will help us to handle per-cpu storage later, which has
a different way of accessing to the actual data.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-10-01 16:18:32 +02:00
Roman Gushchin
8bad74f984 bpf: extend cgroup bpf core to allow multiple cgroup storage types
In order to introduce per-cpu cgroup storage, let's generalize
bpf cgroup core to support multiple cgroup storage types.
Potentially, per-node cgroup storage can be added later.

This commit is mostly a formal change that replaces
cgroup_storage pointer with a array of cgroup_storage pointers.
It doesn't actually introduce a new storage type,
it will be done later.

Each bpf program is now able to have one cgroup storage of each type.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-10-01 16:18:32 +02:00
Kristina Martsenko
8ad50c8985 vgic: Add support for 52bit guest physical address
Add support for handling 52bit guest physical address to the
VGIC layer. So far we have limited the guest physical address
to 48bits, by explicitly masking the upper bits. This patch
removes the restriction. We do not have to check if the host
supports 52bit as the gpa is always validated during an access.
(e.g, kvm_{read/write}_guest, kvm_is_visible_gfn()).
Also, the ITS table save-restore is also not affected with
the enhancement. The DTE entries already store the bits[51:8]
of the ITT_addr (with a 256byte alignment).

Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Christoffer Dall <cdall@kernel.org>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Kristina Martsenko <kristina.martsenko@arm.com>
[ Macro clean ups, fix PROPBASER and PENDBASER accesses ]
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-10-01 13:50:32 +01:00
Zhen Lei
2da274cdf9 iommu/dma: Add support for non-strict mode
With the flush queue infrastructure already abstracted into IOVA
domains, hooking it up in iommu-dma is pretty simple. Since there is a
degree of dependency on the IOMMU driver knowing what to do to play
along, we key the whole thing off a domain attribute which will be set
on default DMA ops domains to request non-strict invalidation. That way,
drivers can indicate the appropriate support by acknowledging the
attribute, and we can easily fall back to strict invalidation otherwise.

The flush queue callback needs a handle on the iommu_domain which owns
our cookie, so we have to add a pointer back to that, but neatly, that's
also sufficient to indicate whether we're using a flush queue or not,
and thus which way to release IOVAs. The only slight subtlety is
switching __iommu_dma_unmap() from calling iommu_unmap() to explicit
iommu_unmap_fast()/iommu_tlb_sync() so that we can elide the sync
entirely in non-strict mode.

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
[rm: convert to domain attribute, tweak comments and commit message]
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-10-01 13:01:32 +01:00
Linus Walleij
30aa69e7bd Merge tag 'v4.19-rc6' into devel
This is the 4.19-rc6 release

I needed to merge this in because of extensive conflicts in
the MSM and Intel pin control drivers. I know how to resolve
them, so let's do it like this.

Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2018-10-01 13:20:45 +02:00
Will Deacon
22839869f2 signal: Introduce COMPAT_SIGMINSTKSZ for use in compat_sys_sigaltstack
The sigaltstack(2) system call fails with -ENOMEM if the new alternative
signal stack is found to be smaller than SIGMINSTKSZ. On architectures
such as arm64, where the native value for SIGMINSTKSZ is larger than
the compat value, this can result in an unexpected error being reported
to a compat task. See, for example:

  https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=904385

This patch fixes the problem by extending do_sigaltstack to take the
minimum signal stack size as an additional parameter, allowing the
native and compat system call entry code to pass in their respective
values. COMPAT_SIGMINSTKSZ is just defined as SIGMINSTKSZ if it has not
been defined by the architecture.

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Reported-by: Steve McIntyre <steve.mcintyre@arm.com>
Tested-by: Steve McIntyre <93sam@debian.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2018-10-01 11:43:55 +01:00
Rafael J. Wysocki
fb64207bf5 Merge branch 'opp/linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm into pm-opp
Pull operating performance points (OPP) material for 4.20 from Viresh Kumar.

"This contains patches that fix several bugs in the OPP core and
makes it more stable."

* 'opp/linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm:
  OPP: Pass OPP table to _of_add_opp_table_v{1|2}()
  OPP: Prevent creating multiple OPP tables for devices sharing OPP nodes
  OPP: Use a single mechanism to free the OPP table
  OPP: Don't remove dynamic OPPs from _dev_pm_opp_remove_table()
  cpufreq: mvebu: Remove OPPs using dev_pm_opp_remove()
  OPP: Create separate kref for static OPPs list
  OPP: Don't take OPP table's kref for static OPPs
  OPP: Parse OPP table's DT properties from _of_init_opp_table()
  OPP: Pass index to _of_init_opp_table()
  OPP: Protect dev_list with opp_table lock
  OPP: Don't try to remove all OPP tables on failure
  OPP: Free OPP table properly on performance state irregularities
2018-10-01 12:32:32 +02:00
Geert Uytterhoeven
3c940660cb gpio: Restore indentation of continued lines
Fixes: 3027743f83 ("gpio: Remove VLA from gpiolib")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2018-10-01 12:22:41 +02:00
Geert Uytterhoeven
cf9af0d578 gpio: Propagate errors from gpiod_set_array_value_complex()
Internal helper function gpiod_set_array_value_complex() was changed to
return an error value, but not all gpiolib callers were updated to
propagate the new error up.

Fixes: 3027743f83 ("gpio: Remove VLA from gpiolib")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2018-10-01 12:22:15 +02:00
Constantine Shulyupin
5da784cce4 fuse: add max_pages to init_out
Replace FUSE_MAX_PAGES_PER_REQ with the configurable parameter max_pages to
improve performance.

Old RFC with detailed description of the problem and many fixes by Mitsuo
Hayasaka (mitsuo.hayasaka.hu@hitachi.com):
 - https://lkml.org/lkml/2012/7/5/136

We've encountered performance degradation and fixed it on a big and complex
virtual environment.

Environment to reproduce degradation and improvement:

1. Add lag to user mode FUSE
Add nanosleep(&(struct timespec){ 0, 1000 }, NULL); to xmp_write_buf in
passthrough_fh.c

2. patch UM fuse with configurable max_pages parameter. The patch will be
provided latter.

3. run test script and perform test on tmpfs
fuse_test()
{

       cd /tmp
       mkdir -p fusemnt
       passthrough_fh -o max_pages=$1 /tmp/fusemnt
       grep fuse /proc/self/mounts
       dd conv=fdatasync oflag=dsync if=/dev/zero of=fusemnt/tmp/tmp \
		count=1K bs=1M 2>&1 | grep -v records
       rm fusemnt/tmp/tmp
       killall passthrough_fh
}

Test results:

passthrough_fh /tmp/fusemnt fuse.passthrough_fh \
	rw,nosuid,nodev,relatime,user_id=0,group_id=0 0 0
1073741824 bytes (1.1 GB) copied, 1.73867 s, 618 MB/s

passthrough_fh /tmp/fusemnt fuse.passthrough_fh \
	rw,nosuid,nodev,relatime,user_id=0,group_id=0,max_pages=256 0 0
1073741824 bytes (1.1 GB) copied, 1.15643 s, 928 MB/s

Obviously with bigger lag the difference between 'before' and 'after'
will be more significant.

Mitsuo Hayasaka, in 2012 (https://lkml.org/lkml/2012/7/5/136),
observed improvement from 400-550 to 520-740.

Signed-off-by: Constantine Shulyupin <const@MakeLinux.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-01 10:07:06 +02:00
Junling Zheng
d440c52d31 f2fs: support superblock checksum
Now we support crc32 checksum for superblock.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Junling Zheng <zhengjunling@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-30 18:34:18 -07:00
Dennis Dalessandro
5da0fc9dbf IB/hfi1: Prepare resource waits for dual leg
Current implementation allows each qp to have only one send engine.  As
such, each qp has only one list to queue prebuilt packets when send engine
resources are not available. To improve performance, it is desired to
support multiple send engines for each qp.

This patch creates the framework to support two send engines
(two legs) for each qp for the TID RDMA protocol, which can be easily
extended to support more send engines. It achieves the goal by creating a
leg specific struct, iowait_work in the iowait struct, to hold the
work_struct and the tx_list as well as a pointer to the parent iowait
struct.

The hfi1_pkt_state now has an additional field to record the current legs
work structure and that is now passed to all egress waiters to determine
the leg that needs to wait via a new iowait helper.  The APIs are adjusted
to use the new leg specific struct as required.

Many new and modified helpers are added to support this change.

Reviewed-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-30 19:21:12 -06:00
Kaike Wan
d205a06a14 IB/rdmavt: Rename check_send_wqe as setup_wqe
The driver-provided function check_send_wqe allows the hardware driver to
check and set up the incoming send wqe before it is inserted into the swqe
ring. This patch will rename it as setup_wqe to better reflect its
usage. In addition, this function is only called when all setup is
complete in rdmavt.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-30 19:21:11 -06:00
Nathan Chancellor
aef716fa5e RDMA/qedr: Remove enumerated type qed_roce_ll2_tx_dest
Clang warns when one enumerated type is explicitly converted to another.

drivers/infiniband/hw/qedr/qedr_roce_cm.c:198:28: warning: implicit
conversion from enumeration type 'enum qed_roce_ll2_tx_dest' to
different enumeration type 'enum qed_ll2_tx_dest' [-Wenum-conversion]
        ll2_tx_pkt.tx_dest = pkt->tx_dest;
                           ~ ~~~~~^~~~~~~
1 warning generated.

Turns out that QED_ROCE_LL2_TX_DEST_NW and QED_ROCE_LL2_TX_DEST_LB are
only used once in the whole tree and QED_ROCE_LL2_TX_DEST_MAX is used
nowhere. Remove them and use the equivalent values from qed_ll2_tx_dest
in their place.

Reported-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Acked-by: Michal Kalderon <michal.kalderon@cavium.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-30 19:21:11 -06:00
Trond Myklebust
571ed1fd23 SUNRPC: Replace krb5_seq_lock with a lockless scheme
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:18 -04:00
Trond Myklebust
1c6c4b740d NFS: Remove private spinlock in struct nfs_pgio_header
Now that each struct nfs_pgio_header corresponds to one RPC call, we
only have one writer to the struct nfs_pgio_header.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:17 -04:00
Trond Myklebust
28d52235ee NFSv4: Save a few bytes in the nfs_pgio_args/res
Save a few bytes by allowing the read/write specific fields of the
structures to share storage.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:17 -04:00
Trond Myklebust
8d8928d879 NFSv3: Improve NFSv3 performance when server returns no post-op attributes
When the server fails to return post-op attributes, the client's
attempt to place read data directly in the page cache fails, and
so we have to do an extra copy in order to realign the data with
page borders.
This patch attempts to detect servers that don't return post-op
attributes on read (e.g. for pNFS) and adjusts the placement
calculation accordingly.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:17 -04:00
Trond Myklebust
0de43976fb NFS: Convert lookups of the open context to RCU
Reduce contention on the inode->i_lock by ensuring that we use RCU
when looking up the NFS open context.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:17 -04:00
Trond Myklebust
1db97eaa0b NFS: Convert lookups of the lock context to RCU
Speed up lookups of an existing lock context by avoiding the inode->i_lock,
and using RCU instead.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:16 -04:00
Trond Myklebust
ec846469ba SUNRPC: Unexport xdr_partial_copy_from_skb()
It is no longer used outside of net/sunrpc/socklib.c

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:16 -04:00
Trond Myklebust
550aebfe1c SUNRPC: Allow AF_LOCAL sockets to use the generic stream receive
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:16 -04:00
Trond Myklebust
c50b8ee02f SUNRPC: Clean up - rename xs_tcp_data_receive() to xs_stream_data_receive()
In preparation for sharing with AF_LOCAL.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:16 -04:00
Trond Myklebust
277e4ab7d5 SUNRPC: Simplify TCP receive code by switching to using iterators
Most of this code should also be reusable with other socket types.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:16 -04:00
Trond Myklebust
9d96acbc7f SUNRPC: Add a bvec array to struct xdr_buf for use with iovec_iter()
Add a bvec array to struct xdr_buf, and have the client allocate it
when we need to receive data into pages.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:16 -04:00
Trond Myklebust
431f6eb357 SUNRPC: Add a label for RPC calls that require allocation on receive
If the RPC call relies on the receive call allocating pages as buffers,
then let's label it so that we
a) Don't leak memory by allocating pages for requests that do not expect
   this behaviour
b) Can optimise for the common case where calls do not require allocation.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:16 -04:00
Trond Myklebust
f42f7c2830 SUNRPC: Fix priority queue fairness
Fix up the priority queue to not batch by owner, but by queue, so that
we allow '1 << priority' elements to be dequeued before switching to
the next priority queue.
The owner field is still used to wake up requests in round robin order
by owner to avoid single processes hogging the RPC layer by loading the
queues.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:16 -04:00
Trond Myklebust
95f7691daa SUNRPC: Convert xprt receive queue to use an rbtree
If the server is slow, we can find ourselves with quite a lot of entries
on the receive queue. Converting the search from an O(n) to O(log(n))
can make a significant difference, particularly since we have to hold
a number of locks while searching.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:16 -04:00
Trond Myklebust
adfa71446d SUNRPC: Cleanup: remove the unused 'task' argument from the request_send()
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:16 -04:00
Trond Myklebust
c544577dad SUNRPC: Clean up transport write space handling
Treat socket write space handling in the same way we now treat transport
congestion: by denying the XPRT_LOCK until the transport signals that it
has free buffer space.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:15 -04:00
Trond Myklebust
36bd7de949 SUNRPC: Turn off throttling of RPC slots for TCP sockets
The theory was that we would need to grab the socket lock anyway, so we
might as well use it to gate the allocation of RPC slots for a TCP
socket.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:15 -04:00
Trond Myklebust
75891f502f SUNRPC: Support for congestion control when queuing is enabled
Both RDMA and UDP transports require the request to get a "congestion control"
credit before they can be transmitted. Right now, this is done when
the request locks the socket. We'd like it to happen when a request attempts
to be transmitted for the first time.
In order to support retransmission of requests that already hold such
credits, we also want to ensure that they get queued first, so that we
don't deadlock with requests that have yet to obtain a credit.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:15 -04:00