The variable err is initialized with a value that is never read
and err is reassigned a few statements later. This initialization
is redundant and can be removed.
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can encounter a short read when we're doing buffered reads and the
data is partially cached. Right now we just return the short read, but
that forces the application to read that CQE, then issue another SQE
to finish the read. That read will not be cached, and hence will result
in an async punt.
It's more efficient to do that async punt from within the kernel, as
that will the not need two round trips more to the kernel.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently these functions return < 0 on error, and 0 for success.
Change that so that we return < 0 on error, but number of bytes
for success.
Some callers already treat the return value that way, others need a
slight tweak.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Due to a confusion I thought that eth_type_trans() was called by the
network stack whereas it can actually be called by network drivers to
figure out the skb protocol and next packet_type handlers.
In light of the above, it is not safe to store the frame type from the
DSA tagger's .filter callback (first entry point on RX path), since GRO
is yet to be invoked on the received traffic. Hence it is very likely
that the skb->cb will actually get overwritten between eth_type_trans()
and the actual DSA packet_type handler.
Of course, what this patch fixes is the actual overwriting of the
SJA1105_SKB_CB(skb)->type field from the GRO layer, which made all
frames be seen as SJA1105_FRAME_TYPE_NORMAL (0).
Fixes: 227d07a07e ("net: dsa: sja1105: Add support for traffic through standalone ports")
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While troubleshooting issues where cloned request limits have been
exceeded, it is often beneficial to know the actual values that
have been breached. Print these values, assisting in ease of
identification of root cause of the breach.
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: John Pittman <jpittman@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch avoids that the kernel-doc script complains about these
function headers when building with W=1.
Cc: Hannes Reinecke <hare@suse.com>
Cc: Keith Busch <keith.busch@intel.com>
Fixes: ed76e329d7 ("blk-mq: abstract out queue map") # v5.0.
Fixes: e42b3867de ("blk-mq-rdma: pass in queue map to blk_mq_rdma_map_queues") # v5.0.
Reviewed-by: Chaitanya Kulkarni <chiatanya.kulkarni@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch avoids that the kernel-doc tool warns about this function
header when building with W=1.
Reviewed-by: Chaitanya Kulkarni <chiatanya.kulkarni@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch avoids that the kernel-doc tool warns about this function
header when building with W=1.
Reviewed-by: Chaitanya Kulkarni <chiatanya.kulkarni@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linux kernel already provide MODULE_SIG and KEXEC_VERIFY_SIG to
make sure loaded kernel module and kernel image are trusted. This
patch adds a kernel command line option "loadpin.exclude" which
allows to exclude specific file types from LoadPin. This is useful
when people want to use different mechanisms to verify module and
kernel image while still use LoadPin to protect the integrity of
other files kernel loads.
Signed-off-by: Ke Wu <mikewu@google.com>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
[kees: fix array size issue reported by Coverity via Colin Ian King]
Signed-off-by: Kees Cook <keescook@chromium.org>
Pull nfsd fix from Bruce Fields:
"This reverts a minor fix which could cause us to treat conflicting NLM
locks as nonconflicting.
We have proper fix queued up for 5.3. In the meantime, a quick revert
seems best for 5.2 and stable"
* tag 'nfsd-5.2-1' of git://linux-nfs.org/~bfields/linux:
Revert "lockd: Show pid of lockd for remote locks"
Pull cifs fixes from Steve French:
"Four small smb3 fixes, one for stable"
* tag 'v5.2-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
CIFS: cifs_read_allocate_pages: don't iterate through whole page array on ENOMEM
dfs_cache: fix a wrong use of kfree in flush_cache_ent()
fs/cifs/smb2pdu.c: fix buffer free in SMB2_ioctl_free
cifs: fix memory leak of pneg_inbuf on -EOPNOTSUPP ioctl case
It turns out that various triggers use led_blink_setup() from atomic
context, so we can't do a flush_work there. Flush is still needed for
slow LEDs, but we can move it to sysfs code where it is safe.
WARNING: inconsistent lock state
5.2.0-rc1 #1 Tainted: G W
--------------------------------
inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
swapper/1/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
000000006e30541b
((work_completion)(&led_cdev->set_brightness_work)){+.?.}, at:
+__flush_work+0x3b/0x38a
{SOFTIRQ-ON-W} state was registered at:
lock_acquire+0x146/0x1a1
__flush_work+0x5b/0x38a
flush_work+0xb/0xd
led_blink_setup+0x1e/0xd3
led_blink_set+0x3f/0x44
tpt_trig_timer+0xdb/0x106
ieee80211_mod_tpt_led_trig+0xed/0x112
Fixes: 0db37915d9 ("leds: avoid races with workqueue")
Signed-off-by: Pavel Machek <pavel@ucw.cz>
Tested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Jacek Anaszewski <jacek.anaszewski@gmail.com>
Currently, we try to report to the shrinker the precise number of
objects (pages) that are available to be reaped at this moment. This
requires searching all objects with allocated pages to see if they
fulfill the search criteria, and this count is performed quite
frequently. (The shrinker tries to free ~128 pages on each invocation,
before which we count all the objects; counting takes longer than
unbinding the objects!) If we take the pragmatic view that with
sufficient desire, all objects are eventually reapable (they become
inactive, or no longer used as framebuffer etc), we can simply return
the count of pinned pages maintained during get_pages/put_pages rather
than walk the lists every time.
The downside is that we may (slightly) over-report the number of
objects/pages we could shrink and so penalize ourselves by shrinking
more than required. This is mitigated by keeping the order in which we
shrink objects such that we avoid penalizing active and frequently used
objects, and if memory is so tight that we need to free them we would
need to anyway.
v2: Only expose shrinkable objects to the shrinker; a small reduction in
not considering stolen and foreign objects.
v3: Restore the tracking from a "backup" copy from before the gem/ split
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190530203500.26272-2-chris@chris-wilson.co.uk
Currently the purgeable objects, I915_MADV_DONTNEED, are mixed in the
normal bound/unbound lists. Every shrinker pass starts with an attempt
to purge from this set of unneeded objects, which entails us doing a
walk over both lists looking for any candidates. If there are none, and
since we are shrinking we can reasonably assume that the lists are
full!, this becomes a very slow futile walk.
If we separate out the purgeable objects into own list, this search then
becomes its own phase that is preferentially handled during shrinking.
Instead the cost becomes that we then need to filter the purgeable list
if we want to distinguish between bound and unbound objects.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Matthew Auld <matthew.william.auld@gmail.com>
Reviewed-by: Matthew Auld <matthew.william.auld@gmail.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190530203500.26272-1-chris@chris-wilson.co.uk
__netdev_tx_sent_queue() was introduced by:
commit 3e59020abf ("net: bql: add __netdev_tx_sent_queue()")
BQL counters should be updated without flipping/caring about
BQL status, if the current skb has xmit_more set.
Using __netdev_tx_sent_queue() avoids messing with BQL stop
flag, increases performance on GSO workload by keeping
doorbells to the minimum required and also sparing atomic
operations.
Signed-off-by: Erez Alfasi <ereza@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
HW does not support push VLAN action in the RX direction (packets
arriving from the wire). The FW works around this limitation by haripining
the packet. The hairpin workaround applies only when the push VLAN action
is specified in a termination table, assuring that there are no actions
following the haripin.
Instantiate termination table for push VLAN actions. Re-use identical
terminating tables for increased HW cache efficiency.
Signed-off-by: Oz Shlomo <ozsh@mellanox.com>
Reviewed-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Eli Britstein <elibr@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Add HW offloading support for flows with Geneve encap/decap.
Notes about decap flows with Geneve TLV Options:
- Support offloading of 32-bit options data only
- At any given time, only one combination of class/type parameters
can be offloaded, but the same class/type combination can have
many different flows offloaded with different 32-bit option data
- Options with value of 0 can't be offloaded
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Rearrange tc tunnel code so that it would be easy to add future tunnels:
- Define tc tunnel object with the fields and callbacks that any
tunnel must implement.
- Define tc UDP tunnel object for UDP tunnels, such as VXLAN
- Move each tunnel code (GRE, VXLAN) to its own separate file
- Rewrite tc tunnel implementation in a general way - using only
the objects and their callbacks.
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
In mlx5e encap entry structure, IP tunnel info data structure is copied
by value. This approach worked till now, but it breaks when there are
encapsulation options, such as in case of Geneve.
These options are stored in the structure that is allocated adjacent to
the IP tunnel info struct, and not pointed at by any field in that struct.
Therefore, when copying the struct by value, we loose the address of the
original struct and can't get to the encapsulation options.
Fix the problem by storing the pointer to the tunnel info data instead.
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Use Geneve TLV Options object to manage the flex parser matching
on the 32-bit options data.
When the first flow with a certain class/type values is requested to
be offloaded, create a FW object with FW command (Geneve TLV Options
general object) and start counting the number of flows using this object.
During this time, any request with a different class/type values will
fail to be offloaded.
Once the refcount reaches 0, destroy the TLV options general object,
and can now offload a flow with any class/type parameters.
Geneve TLV Options object is added to core device.
It is currently used to manage Geneve TLV options general
object allocation in FW and its reference counting only.
In the future it will also be used for managing geneve ports
by registering callbacks for ndo_udp_tunnel_add/del.
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When filling in flow spec match criteria, to allow previous
modifications of the match criteria, use "|=" rather than "=".
Tunnel options are parsed before the match criteria of the offloaded
flow are being set. If the the flow that we're about to offload has
encapsulation options, the flow group might need to match on additional
criteria.
For Geneve, an additional flow group matching parameter should
be used - misc3. The appropriate bit in the match criteria is set
while parsing the tunnel options, so the criteria value shouldn't
be overwritten.
This is a pre-step for supporting Geneve TLV options offload.
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
In some case, we don't care the enc_src_ip and enc_dst_ip, and
if we don't match the field enc_src_ip and enc_dst_ip, we can use
fewer flows in hardware when revice the tunnel packets. For example,
the tunnel packets may be sent from different hosts, we must offload
one rule for each host.
$ tc filter add dev vxlan0 protocol ip parent ffff: prio 1 \
flower dst_mac 00:11:22:33:44:00 \
enc_src_ip Host0_IP enc_dst_ip 2.2.2.100 \
enc_dst_port 4789 enc_key_id 100 \
action tunnel_key unset action mirred egress redirect dev eth0_1
$ tc filter add dev vxlan0 protocol ip parent ffff: prio 1 \
flower dst_mac 00:11:22:33:44:00 \
enc_src_ip Host1_IP enc_dst_ip 2.2.2.100 \
enc_dst_port 4789 enc_key_id 100 \
action tunnel_key unset action mirred egress redirect dev eth0_1
If we support flows which only match the enc_key_id and enc_dst_port,
a flow can process the packets sent to VM which (mac 00:11:22:33:44:00).
$ tc filter add dev vxlan0 protocol ip parent ffff: prio 1 \
flower dst_mac 00:11:22:33:44:00 \
enc_dst_port 4789 enc_key_id 100 \
action tunnel_key unset action mirred egress redirect dev eth0_1
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Beside the special vports (PF/uplink/ecpf), the rest of the vports
are similar.
Remove vf_ prefix from function and variable names.
This patch does not change any functionality.
Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
This series provides some low level updates for mlx5 driver needed for
both rdma and netdev trees.
1) Termination flow steering table bits and hardware definitions.
2) Introduce the core dump HW access registers definitions.
3) Refactor and cleans-up VF representors functions handlers.
4) Renames host_params bits to function_changed bits and add the
support for eswitch functions change event in the eswitch general case.
(for both legacy and switchdev modes).
5) Potential error pointer dereference in error handling
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Russell King says:
====================
phylink/sfp updates
This is a series of updates to phylink and sfp:
- Remove an unused net device argument from the phylink MII ioctl
emulation code.
- add support for using interrupts when using a GPIO for link status
tracking, rather than polling it at one second intervals. This
reduces the need to wakeup the CPU every second.
- add support to the MII ioctl API to read and write Clause 45 PHY
registers. I don't know how desirable this is for mainline, but I
have used this facility extensively to investigate the Marvell
88x3310 PHY. A recent illustration of use for this was debugging
the PHY-without-firmware problem recently reported.
- add mandatory attach/detach methods for the upstream side of sfp
bus code, which will allow us to remove the "netdev" structure from
the SFP layers.
- remove the "netdev" structure from the SFP upstream registration
calls, which simplifies PHY to SFP links.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The sfp-bus code now no longer has any use for the network device
structure, so remove its use.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add attach and detach methods for SFP buses, which will allow us to get
rid of the netdev storage in sfp-bus.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow userspace to generate Clause 45 MII access cycles via phylib.
This is useful for tools such as mii-diag to be able to inspect Clause
45 PHYs.
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for using GPIO interrupts with a fixed-link GPIO rather than
polling the GPIO every second and invoking the phylink resolution. This
avoids unnecessary calls to mac_config().
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently for every representor type and for every single vport,
representer function pointers copy is stored even though they don't
change from one to other vport.
Additionally priv data entry for the rep is not passed during
registration, but its copied. It is used (set and cleared) by the user
of the reps.
As we want to scale vports, to simplify and also to split constants
from data,
1. Rename mlx5_eswitch_rep_if to mlx5_eswitch_rep_ops as to match _ops
prefix with other standard netdev, ibdev ops.
2. Constify the IB and Ethernet rep ops structure.
3. Instead of storing copy of all rep function pointers, store copy
per eswitch rep type.
4. Split data and function pointers to mlx5_eswitch_rep_ops and
mlx5_eswitch_rep_data.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Avoid typecasting from void* to mlx5_ib_dev* or mlx5e_rep_priv*
as it is not needed.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Whenever device supports eswitch functions changed event, honor
such device setting. Do not limit it to ECPF.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
To support sriov on a E-Switch manager, num_vfs are queried
to the firmware whenever E-Switch manager is notified by
esw_functions_changed event.
Replace host_params event with esw_functions_changed event that reflects
more appropriate naming.
While at it, also correct num_vfs type from int to u16 as expected by
the function mlx5_esw_query_functions().
Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Termination table is a flow table with a termination flag. The flag
allows the firmware to assume that the the specified actions are the last
actions list. This assumption allows the FW to safely perform potential
looping logic (e.g. hairpin). Introduce the bits for this attribute.
Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When the user submits more than 32 work request to a srq queue
at a time, it needs to find the corresponding number of entries
in the bitmap in the idx queue. However, the original lookup
function named ffs only processes 32 bits of the array element,
When the number of srq wqe issued exceeds 32, the ffs will only
process the lower 32 bits of the elements, it will not be able
to get the correct wqe index for srq wqe.
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
cgroup already uses floating point for percent[ile] numbers and there
are several controllers which want to take them as input. Add a
generic parse helper to handle inputs.
Update the interface convention documentation about the use of
percentage numbers. While at it, also clarify the default time unit.
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull integrity subsystem fixes from Mimi Zohar:
"Four bug fixes, none 5.2-specific, all marked for stable.
The first two are related to the architecture specific IMA policy
support. The other two patches, one is related to EVM signatures,
based on additional hash algorithms, and the other is related to
displaying the IMA policy"
* 'next-fixes-for-5.2-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
ima: show rules with IMA_INMASK correctly
evm: check hash algorithm passed to init_desc()
ima: fix wrong signed policy requirement when not appraising
x86/ima: Check EFI_RUNTIME_SERVICES before using
Pull xen fixes from Juergen Gross:
"One minor cleanup patch and a fix for handling of live migration when
running as Xen guest"
* tag 'for-linus-5.2b-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xenbus: Avoid deadlock during suspend due to open transactions
xen/pvcalls: Remove set but not used variable
Since a283348629 ("page cache: Finish XArray conversion"), on most
major Linux distributions, the page cache doesn't correctly transition
when the hot data set is changing, and leaves the new pages thrashing
indefinitely instead of kicking out the cold ones.
On a freshly booted, freshly ssh'd into virtual machine with 1G RAM
running stock Arch Linux:
[root@ham ~]# ./reclaimtest.sh
+ dd of=workingset-a bs=1M count=0 seek=600
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ ./mincore workingset-a
153600/153600 workingset-a
+ dd of=workingset-b bs=1M count=0 seek=600
+ cat workingset-b
+ cat workingset-b
+ cat workingset-b
+ cat workingset-b
+ ./mincore workingset-a workingset-b
104029/153600 workingset-a
120086/153600 workingset-b
+ cat workingset-b
+ cat workingset-b
+ cat workingset-b
+ cat workingset-b
+ ./mincore workingset-a workingset-b
104029/153600 workingset-a
120268/153600 workingset-b
workingset-b is a 600M file on a 1G host that is otherwise entirely
idle. No matter how often it's being accessed, it won't get cached.
While investigating, I noticed that the non-resident information gets
aggressively reclaimed - /proc/vmstat::workingset_nodereclaim. This is
a problem because a workingset transition like this relies on the
non-resident information tracked in the page cache tree of evicted
file ranges: when the cache faults are refaults of recently evicted
cache, we challenge the existing active set, and that allows a new
workingset to establish itself.
Tracing the shrinker that maintains this memory revealed that all page
cache tree nodes were allocated to the root cgroup. This is a problem,
because 1) the shrinker sizes the amount of non-resident information
it keeps to the size of the cgroup's other memory and 2) on most major
Linux distributions, only kernel threads live in the root cgroup and
everything else gets put into services or session groups:
[root@ham ~]# cat /proc/self/cgroup
0::/user.slice/user-0.slice/session-c1.scope
As a result, we basically maintain no non-resident information for the
workloads running on the system, thus breaking the caching algorithm.
Looking through the code, I found the culprit in the above-mentioned
patch: when switching from the radix tree to xarray, it dropped the
__GFP_ACCOUNT flag from the tree node allocations - the flag that
makes sure the allocated memory gets charged to and tracked by the
cgroup of the calling process - in this case, the one doing the fault.
To fix this, allow xarray users to specify per-tree flag that makes
xarray allocate nodes using __GFP_ACCOUNT. Then restore the page cache
tree annotation to request such cgroup tracking for the cache nodes.
With this patch applied, the page cache correctly converges on new
workingsets again after just a few iterations:
[root@ham ~]# ./reclaimtest.sh
+ dd of=workingset-a bs=1M count=0 seek=600
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ ./mincore workingset-a
153600/153600 workingset-a
+ dd of=workingset-b bs=1M count=0 seek=600
+ cat workingset-b
+ ./mincore workingset-a workingset-b
124607/153600 workingset-a
87876/153600 workingset-b
+ cat workingset-b
+ ./mincore workingset-a workingset-b
81313/153600 workingset-a
133321/153600 workingset-b
+ cat workingset-b
+ ./mincore workingset-a workingset-b
63036/153600 workingset-a
153600/153600 workingset-b
Cc: stable@vger.kernel.org # 4.20+
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
The phylink conflict was between a bug fix by Russell King
to make sure we have a consistent PHY interface mode, and
a change in net-next to pull some code in phylink_resolve()
into the helper functions phylink_mac_link_{up,down}()
On the dp83867 side it's mostly overlapping changes, with
the 'net' side removing a condition that was supposed to
trigger for RGMII but because of how it was coded never
actually could trigger.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull s390 fixes from Heiko Carstens:
- Farewell Martin Schwidefsky: add Martin to CREDITS and remove him
from MAINTAINERS
- Vasily Gorbik and Christian Borntraeger join as maintainers for s390
- Fix locking bug in ctr(aes) and ctr(des) s390 specific ciphers
- A rather large patch which fixes gcm-aes-s390 scatter gather handling
- Fix zcrypt wrong dispatching for control domain CPRBs
- Fix assignment of bus resources in PCI code
- Fix structure definition for set PCI function
- Fix one compile error and one compile warning seen when
CONFIG_OPTIMIZE_INLINING is enabled
* tag 's390-5.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
MAINTAINERS: add Vasily Gorbik and Christian Borntraeger for s390
MAINTAINERS: Farewell Martin Schwidefsky
s390/crypto: fix possible sleep during spinlock aquired
s390/crypto: fix gcm-aes-s390 selftest failures
s390/zcrypt: Fix wrong dispatching for control domain CPRBs
s390/pci: fix assignment of bus resources
s390/pci: fix struct definition for set PCI function
s390: mark __cpacf_check_opcode() and cpacf_query_func() as __always_inline
s390: add unreachable() to dump_fault_info() to fix -Wmaybe-uninitialized
CSS_TASK_ITER_PROCS currently iterates live group leaders; however,
this means that a process with dying leader and live threads will be
skipped. IOW, cgroup.procs might be empty while cgroup.threads isn't,
which is confusing to say the least.
Fix it by making cset track dying tasks and include dying leaders with
live threads in PROCS iteration.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Topi Miettinen <toiwoton@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
When a task is moved out of a cset, task iterators pointing to the
task are advanced using the normal css_task_iter_advance() call. This
is fine but we'll be tracking dying tasks on csets and thus moving
tasks from cset->tasks to (to be added) cset->dying_tasks. When we
remove a task from cset->tasks, if we advance the iterators, they may
move over to the next cset before we had the chance to add the task
back on the dying list, which can allow the task to escape iteration.
This patch separates out skipping from advancing. Skipping only moves
the affected iterators to the next pointer rather than fully advancing
it and the following advancing will recognize that the cursor has
already been moved forward and do the rest of advancing. This ensures
that when a task moves from one list to another in its cset, as long
as it moves in the right direction, it's always visible to iteration.
This doesn't cause any visible behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>