Commit Graph

96271 Commits

Author SHA1 Message Date
Yuchung Cheng
737ff31456 tcp: use sequence distance to detect reordering
Replace the reordering distance measurement in packet unit with
sequence based approach. Previously it trackes the number of "packets"
toward the forward ACK (i.e.  highest sacked sequence)in a state
variable "fackets_out".

Precisely measuring reordering degree on packet distance has not much
benefit, as the degree constantly changes by factors like path, load,
and congestion window. It is also complicated and prone to arcane bugs.
This patch replaces with sequence-based approach that's much simpler.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-11 18:53:16 +09:00
Yuchung Cheng
713bafea92 tcp: retire FACK loss detection
FACK loss detection has been disabled by default and the
successor RACK subsumed FACK and can handle reordering better.
This patch removes FACK to simplify TCP loss recovery.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-11 18:53:16 +09:00
David S. Miller
f3edacbd69 bpf: Revert bpf_overrid_function() helper changes.
NACK'd by x86 maintainer.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-11 18:24:55 +09:00
Jon Maloy
8d6e79d3ce tipc: improve link resiliency when rps is activated
Currently, the TIPC RPS dissector is based only on the incoming packets'
source node address, hence steering all traffic from a node to the same
core. We have seen that this makes the links vulnerable to starvation
and unnecessary resets when we turn down the link tolerance to very low
values.

To reduce the risk of this happening, we exempt probe and probe replies
packets from the convergence to one core per source node. Instead, we do
the opposite, - we try to diverge those packets across as many cores as
possible, by randomizing the flow selector key.

To make such packets identifiable to the dissector, we add a new
'is_keepalive' bit to word 0 of the LINK_PROTOCOL header. This bit is
set both for PROBE and PROBE_REPLY messages, and only for those.

It should be noted that these packets are not part of any flow anyway,
and only constitute a minuscule fraction of all packets sent across a
link. Hence, there is no risk that this will affect overall performance.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-11 15:36:05 +09:00
Maciej Żenczykowski
2210d6b2f2 net: ipv6: sysctl to specify IPv6 ND traffic class
Add a per-device sysctl to specify the default traffic class to use for
kernel originated IPv6 Neighbour Discovery packets.

Currently this includes:

  - Router Solicitation (ICMPv6 type 133)
    ndisc_send_rs() -> ndisc_send_skb() -> ip6_nd_hdr()

  - Neighbour Solicitation (ICMPv6 type 135)
    ndisc_send_ns() -> ndisc_send_skb() -> ip6_nd_hdr()

  - Neighbour Advertisement (ICMPv6 type 136)
    ndisc_send_na() -> ndisc_send_skb() -> ip6_nd_hdr()

  - Redirect (ICMPv6 type 137)
    ndisc_send_redirect() -> ndisc_send_skb() -> ip6_nd_hdr()

and if the kernel ever gets around to generating RA's,
it would presumably also include:

  - Router Advertisement (ICMPv6 type 134)
    (radvd daemon could pick up on the kernel setting and use it)

Interface drivers may examine the Traffic Class value and translate
the DiffServ Code Point into a link-layer appropriate traffic
prioritization scheme.  An example of mapping IETF DSCP values to
IEEE 802.11 User Priority values can be found here:

    https://tools.ietf.org/html/draft-ietf-tsvwg-ieee-802-11

The expected primary use case is to properly prioritize ND over wifi.

Testing:
  jzem22:~# cat /proc/sys/net/ipv6/conf/eth0/ndisc_tclass
  0
  jzem22:~# echo -1 > /proc/sys/net/ipv6/conf/eth0/ndisc_tclass
  -bash: echo: write error: Invalid argument
  jzem22:~# echo 256 > /proc/sys/net/ipv6/conf/eth0/ndisc_tclass
  -bash: echo: write error: Invalid argument
  jzem22:~# echo 0 > /proc/sys/net/ipv6/conf/eth0/ndisc_tclass
  jzem22:~# echo 255 > /proc/sys/net/ipv6/conf/eth0/ndisc_tclass
  jzem22:~# cat /proc/sys/net/ipv6/conf/eth0/ndisc_tclass
  255
  jzem22:~# echo 34 > /proc/sys/net/ipv6/conf/eth0/ndisc_tclass
  jzem22:~# cat /proc/sys/net/ipv6/conf/eth0/ndisc_tclass
  34

  jzem22:~# echo $[0xDC] > /proc/sys/net/ipv6/conf/eth0/ndisc_tclass
  jzem22:~# tcpdump -v -i eth0 icmp6 and src host jzem22.pgc and dst host fe80::1
  tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
  IP6 (class 0xdc, hlim 255, next-header ICMPv6 (58) payload length: 24)
  jzem22.pgc > fe80::1: [icmp6 sum ok] ICMP6, neighbor advertisement,
  length 24, tgt is jzem22.pgc, Flags [solicited]

(based on original change written by Erik Kline, with minor changes)

v2: fix 'suspicious rcu_dereference_check() usage'
    by explicitly grabbing the rcu_read_lock.

Cc: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Erik Kline <ek@google.com>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-11 15:13:02 +09:00
Josef Bacik
dd0bb688ea bpf: add a bpf_override_function helper
Error injection is sloppy and very ad-hoc.  BPF could fill this niche
perfectly with it's kprobe functionality.  We could make sure errors are
only triggered in specific call chains that we care about with very
specific situations.  Accomplish this with the bpf_override_funciton
helper.  This will modify the probe'd callers return value to the
specified value and set the PC to an override function that simply
returns, bypassing the originally probed function.  This gives us a nice
clean way to implement systematic error injection for all of our code
paths.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-11 12:18:05 +09:00
Girish Moodalbail
54985120a1 net: fix incorrect comment with regard to VLAN packet handling
The commit bcc6d47903 ("net: vlan: make non-hw-accel rx path similar
to hw-accel") unified accel and non-accel path for VLAN RX. With that
fix we do not register any packet_type handler for VLANs anymore, so fix
the incorrect comment.

Signed-off-by: Girish Moodalbail <girish.moodalbail@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-10 18:04:21 +09:00
Manish Kurup
4c5b9d9642 act_vlan: VLAN action rewrite to use RCU lock/unlock and update
Using a spinlock in the VLAN action causes performance issues when the VLAN
action is used on multiple cores. Rewrote the VLAN action to use RCU read
locking for reads and updates instead.
All functions now use an RCU dereferenced pointer to access the VLAN action
context. Modified helper functions used by other modules, to use the RCU as
opposed to directly accessing the structure.

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Manish Kurup <manish.kurup@verizon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-10 15:32:20 +09:00
Eric Dumazet
356d1833b6 tcp: Namespace-ify sysctl_tcp_rmem and sysctl_tcp_wmem
Note that when a new netns is created, it inherits its
sysctl_tcp_rmem and sysctl_tcp_wmem from initial netns.

This change is needed so that we can refine TCP rcvbuf autotuning,
to take RTT into consideration.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-10 14:34:58 +09:00
Eric Dumazet
a3dcaf17ee net: allow per netns sysctl_rmem and sysctl_wmem for protos
As we want to gradually implement per netns sysctl_rmem and sysctl_wmem
on per protocol basis, add two new fields in struct proto,
and two new helpers : sk_get_wmem0() and sk_get_rmem0()

First user will be TCP. Then UDP and SCTP can be easily converted,
while DECNET probably wont get this support.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-10 14:34:58 +09:00
David S. Miller
4fdc3023c6 Merge tag 'mlx5-updates-2017-11-09' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:

====================
mlx5-updates-2017-11-09

This series introduces vlan offloads related improvements for mlx5
ethernet netdev driver, from Gal Pressman.

 - Add support for 802.1ad vlan filter
 - Add support for 802.1ad vlan insertion
 - Add vlan offloads statistics to ethtool (inserted/stripped vlans)
 - CHECKSUM_COMPLETE support for vlan traffic when vlan stripping is off! (Finally)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-10 13:44:46 +09:00
Andrew Lunn
47d5b6db2a net: bridge: Add/del switchdev object on host join/leave
When the host joins or leaves a multicast group, use switchdev to add
an object to the hardware to forward traffic for the group to the
host.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-10 13:41:40 +09:00
David S. Miller
4dc6758d78 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Simple cases of overlapping changes in the packet scheduler.

Must easier to resolve this time.

Which probably means that I screwed it up somehow.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-10 10:00:18 +09:00
Linus Torvalds
d93d4ce103 Merge tag 'sound-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
 "The amount of the changes isn't as quite small as wished, nevertheless
  they are straight fixes that deserve merging to 4.14 final.

  Most of fixes are about ALSA core bugs spotted by fuzzer: a follow-up
  fix for the previous nested rwsem patch, a fix to avoid the resource
  hogs due to too many concurrent ALSA timer invocations, and a fix for
  a crash with SYSEX MIDI transfer over OSS sequencer emulation that is
  used by none but fuzzer.

  The rest are usual HD-audio and USB-audio device-specific quirks,
  which are safe to apply"

* tag 'sound-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ALSA: hda - fix headset mic problem for Dell machines with alc274
  ALSA: seq: Fix OSS sysex delivery in OSS emulation
  ALSA: seq: Avoid invalid lockdep class warning
  ALSA: timer: Limit max instances per timer
  ALSA: usb-audio: support new Amanero Combo384 firmware version
2017-11-09 09:58:11 -08:00
Linus Torvalds
d1041cdc60 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix use-after-free in IPSEC input parsing, desintation address
    pointer was loaded before pskb_may_pull() which can change the SKB
    data pointers. From Florian Westphal.

 2) Stack out-of-bounds read in xfrm_state_find(), from Steffen
    Klassert.

 3) IPVS state of SKB is not properly reset when moving between
    namespaces, from Ye Yin.

 4) Fix crash in asix driver suspend and resume, from Andrey Konovalov.

 5) Don't deliver ipv6 l2tp tunnel packets to ipv4 l2tp tunnels, and
    vice versa, from Guillaume Nault.

 6) Fix DSACK undo on non-dup ACKs, from Priyaranjan Jha.

 7) Fix regression in bond_xmit_hash()'s behavior after the TCP port
    selection changes back in 4.2, from Hangbin Liu.

 8) Two divide by zero bugs in USB networking drivers when parsing
    descriptors, from Bjorn Mork.

 9) Fix bonding slaves being stuck in BOND_LINK_FAIL state, from Jay
    Vosburgh.

10) Missing skb_reset_mac_header() in qmi_wwan, from Kristian Evensen.

11) Fix the destruction of tc action object races properly, from Cong
    Wang.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (31 commits)
  cls_u32: use tcf_exts_get_net() before call_rcu()
  cls_tcindex: use tcf_exts_get_net() before call_rcu()
  cls_rsvp: use tcf_exts_get_net() before call_rcu()
  cls_route: use tcf_exts_get_net() before call_rcu()
  cls_matchall: use tcf_exts_get_net() before call_rcu()
  cls_fw: use tcf_exts_get_net() before call_rcu()
  cls_flower: use tcf_exts_get_net() before call_rcu()
  cls_flow: use tcf_exts_get_net() before call_rcu()
  cls_cgroup: use tcf_exts_get_net() before call_rcu()
  cls_bpf: use tcf_exts_get_net() before call_rcu()
  cls_basic: use tcf_exts_get_net() before call_rcu()
  net_sched: introduce tcf_exts_get_net() and tcf_exts_put_net()
  Revert "net_sched: hold netns refcnt for each action"
  net: usb: asix: fill null-ptr-deref in asix_suspend
  Revert "net: usb: asix: fill null-ptr-deref in asix_suspend"
  qmi_wwan: Add missing skb_reset_mac_header-call
  bonding: fix slave stuck in BOND_LINK_FAIL state
  qrtr: Move to postcore_initcall
  net: qmi_wwan: fix divide by 0 on bad descriptors
  net: cdc_ether: fix divide by 0 on bad descriptors
  ...
2017-11-09 09:31:34 -08:00
Gal Pressman
4382c7b92a net/mlx5e: Add 802.1ad VLAN insertion support
Report VLAN insertion support for S-tagged packets and add support by
choosing the correct VLAN type in the WQE.

Signed-off-by: Gal Pressman <galp@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-11-09 13:27:35 +09:00
Gal Pressman
375ef2b1f0 net: Introduce netdev_*_once functions
Extend the net device error logging with netdev_*_once macros.
netdev_*_once are the equivalents of the dev_*_once macros which are
useful for messages that should only be logged once.

Also add netdev_WARN_ONCE, which is the "once" extension for the already
existing netdev_WARN macro.

Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-11-09 13:27:01 +09:00
Cong Wang
e4b95c41df net_sched: introduce tcf_exts_get_net() and tcf_exts_put_net()
Instead of holding netns refcnt in tc actions, we can minimize
the holding time by saving it in struct tcf_exts instead. This
means we can just hold netns refcnt right before call_rcu() and
release it after tcf_exts_destroy() is done.

However, because on netns cleanup path we call tcf_proto_destroy()
too, obviously we can not hold netns for a zero refcnt, in this
case we have to do cleanup synchronously. It is fine for RCU too,
the caller cleanup_net() already waits for a grace period.

For other cases, refcnt is non-zero and we can safely grab it as
normal and release it after we are done.

This patch provides two new API for each filter to use:
tcf_exts_get_net() and tcf_exts_put_net(). And all filters now can
use the following pattern:

void __destroy_filter() {
  tcf_exts_destroy();
  tcf_exts_put_net();  // <== release netns refcnt
  kfree();
}
void some_work() {
  rtnl_lock();
  __destroy_filter();
  rtnl_unlock();
}
void some_rcu_callback() {
  tcf_queue_work(some_work);
}

if (tcf_exts_get_net())  // <== hold netns refcnt
  call_rcu(some_rcu_callback);
else
  __destroy_filter();

Cc: Lucas Bates <lucasb@mojatatu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-09 10:03:09 +09:00
Cong Wang
c7e460ce55 Revert "net_sched: hold netns refcnt for each action"
This reverts commit ceffcc5e25.
If we hold that refcnt, the netns can never be destroyed until
all actions are destroyed by user, this breaks our netns design
which we expect all actions are destroyed when we destroy the
whole netns.

Cc: Lucas Bates <lucasb@mojatatu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-09 10:03:09 +09:00
Vivien Didelot
ec15dd4269 net: dsa: setup and teardown tree
This commit provides better scope for the DSA tree setup and teardown
functions. It renames the "applied" bool to "setup" and print a message
when the tree is setup, as it is done during teardown.

At the same time, check dst->setup in dsa_tree_setup, where it is set to
true.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-09 09:26:49 +09:00
Vivien Didelot
24a9332a58 net: dsa: constify cpu_dp member of dsa_port
A DSA port has a dedicated CPU port assigned to it, stored in the cpu_dp
member. It is not meant to be modified by a port, thus make it const.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-09 09:26:49 +09:00
Yi Yang
b2d0f5d5dc openvswitch: enable NSH support
v16->17
 - Fixed disputed check code: keep them in nsh_push and nsh_pop
   but also add them in __ovs_nla_copy_actions

v15->v16
 - Add csum recalculation for nsh_push, nsh_pop and set_nsh
   pointed out by Pravin
 - Move nsh key into the union with ipv4 and ipv6 and add
   check for nsh key in match_validate pointed out by Pravin
 - Add nsh check in validate_set and __ovs_nla_copy_actions

v14->v15
 - Check size in nsh_hdr_from_nlattr
 - Fixed four small issues pointed out By Jiri and Eric

v13->v14
 - Rename skb_push_nsh to nsh_push per Dave's comment
 - Rename skb_pop_nsh to nsh_pop per Dave's comment

v12->v13
 - Fix NSH header length check in set_nsh

v11->v12
 - Fix missing changes old comments pointed out
 - Fix new comments for v11

v10->v11
 - Fix the left three disputable comments for v9
   but not fixed in v10.

v9->v10
 - Change struct ovs_key_nsh to
       struct ovs_nsh_key_base base;
       __be32 context[NSH_MD1_CONTEXT_SIZE];
 - Fix new comments for v9

v8->v9
 - Fix build error reported by daily intel build
   because nsh module isn't selected by openvswitch

v7->v8
 - Rework nested value and mask for OVS_KEY_ATTR_NSH
 - Change pop_nsh to adapt to nsh kernel module
 - Fix many issues per comments from Jiri Benc

v6->v7
 - Remove NSH GSO patches in v6 because Jiri Benc
   reworked it as another patch series and they have
   been merged.
 - Change it to adapt to nsh kernel module added by NSH
   GSO patch series

v5->v6
 - Fix the rest comments for v4.
 - Add NSH GSO support for VxLAN-gpe + NSH and
   Eth + NSH.

v4->v5
 - Fix many comments by Jiri Benc and Eric Garver
   for v4.

v3->v4
 - Add new NSH match field ttl
 - Update NSH header to the latest format
   which will be final format and won't change
   per its author's confirmation.
 - Fix comments for v3.

v2->v3
 - Change OVS_KEY_ATTR_NSH to nested key to handle
   length-fixed attributes and length-variable
   attriubte more flexibly.
 - Remove struct ovs_action_push_nsh completely
 - Add code to handle nested attribute for SET_MASKED
 - Change PUSH_NSH to use the nested OVS_KEY_ATTR_NSH
   to transfer NSH header data.
 - Fix comments and coding style issues by Jiri and Eric

v1->v2
 - Change encap_nsh and decap_nsh to push_nsh and pop_nsh
 - Dynamically allocate struct ovs_action_push_nsh for
   length-variable metadata.

OVS master and 2.8 branch has merged NSH userspace
patch series, this patch is to enable NSH support
in kernel data path in order that OVS can support
NSH in compat mode by porting this.

Signed-off-by: Yi Yang <yi.y.yang@intel.com>
Acked-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Eric Garver <e@erig.me>
Acked-by: Pravin Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 16:12:33 +09:00
David S. Miller
2eb3ed33e5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS updates for your net-next
tree, they are:

1) Speed up table replacement on busy systems with large tables
   (and many cores) in x_tables. Now xt_replace_table() synchronizes by
   itself by waiting until all cpus had an even seqcount and we use no
   use seqlock when fetching old counters, from Florian Westphal.

2) Add nf_l4proto_log_invalid() and nf_ct_l4proto_log_invalid() to speed
   up packet processing in the fast path when logging is not enabled, from
   Florian Westphal.

3) Precompute masked address from configuration plane in xt_connlimit,
   from Florian.

4) Don't use explicit size for set selection if performance set policy
   is selected.

5) Allow to get elements from an existing set in nf_tables.

6) Fix incorrect check in nft_hash_deactivate(), from Florian.

7) Cache netlink attribute size result in l4proto->nla_size, from
   Florian.

8) Handle NFPROTO_INET in nf_ct_netns_get() from conntrack core.

9) Use power efficient workqueue in conntrack garbage collector, from
   Vincent Guittot.

10) Remove unnecessary parameter, in conntrack l4proto functions, also
    from Florian.

11) Constify struct nf_conntrack_l3proto definitions, from Florian.

12) Remove all typedefs in nf_conntrack_h323 via coccinelle semantic
    patch, from Harsha Sharma.

13) Don't store address in the rbtree nodes in xt_connlimit, they are
    never used, from Florian.

14) Fix out of bound access in the conntrack h323 helper, patch from
    Eric Sesterhenn.

15) Print symbols for the address returned with %pS in IPVS, from
    Helge Deller.

16) Proc output should only display its own netns in IPVS, from
    KUWAZAWA Takuya.

17) Small clean up in size_entry_mwt(), from Colin Ian King.

18) Use test_and_clear_bit from nf_nat_proto_clean() instead of separated
    non-atomic test and then clear bit, from Florian Westphal.

19) Consolidate prefix length maps in ipset, from Aaron Conole.

20) Fix sparse warnings in ipset, from Jozsef Kadlecsik.

21) Simplify list_set_memsize(), from simran singhal.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 14:22:50 +09:00
Arnd Bergmann
96c623e51f of: add of_property_read_variable_* dummy helpers
Commit a67e9472da ("of: Add array read functions with min/max size
limits") added a new interface for reading variable-length arrays from
DT properties. One user was added in dsa recently and this causes a
build error because that code can be built with CONFIG_OF disabled:

net/dsa/dsa2.c: In function 'dsa_switch_parse_member_of':
net/dsa/dsa2.c:678:7: error: implicit declaration of function 'of_property_read_variable_u32_array'; did you mean 'of_property_read_u32_array'? [-Werror=implicit-function-declaration]

This adds a dummy functions for of_property_read_variable_u32_array()
and a few others that had been missing here. I decided to move
of_property_read_string() and of_property_read_string_helper() in the
process to make it easier to compare the two sets of function prototypes
to make sure they match.

Fixes: 975e6e3221 ("net: dsa: rework switch parsing")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Rob Herring <robh@kernel.org>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 13:31:47 +09:00
Egil Hjelmeland
620a5c860b net: dsa: lan9303: Correct register names in comments
Two comments refer to registers, but lack the LAN9303_ prefix.
Fix that.

Signed-off-by: Egil Hjelmeland <privat@egil-hjelmeland.no>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 13:29:05 +09:00
Nogah Frankel
8521db4c7e net_sch: cbs: Change TC_SETUP_CBS to TC_SETUP_QDISC_CBS
Change TC_SETUP_CBS to TC_SETUP_QDISC_CBS to match the new convention..

Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 12:23:38 +09:00
Nogah Frankel
575ed7d39e net_sch: mqprio: Change TC_SETUP_MQPRIO to TC_SETUP_QDISC_MQPRIO
Change TC_SETUP_MQPRIO to TC_SETUP_QDISC_MQPRIO to match the new
convention.

Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 12:23:38 +09:00
Nogah Frankel
602f3baf22 net_sch: red: Add offload ability to RED qdisc
Add the ability to offload RED qdisc by using ndo_setup_tc.
There are four commands for RED offloading:
* TC_RED_SET: handles set and change.
* TC_RED_DESTROY: handle qdisc destroy.
* TC_RED_STATS: update the qdiscs counters (given as reference)
* TC_RED_XSTAT: returns red xstats.

Whether RED is being offloaded is being determined every time dump action
is being called because parent change of this qdisc could change its
offload state but doesn't require any RED function to be called.

Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 12:23:37 +09:00
Tom Herbert
fddb231ebe ila: Add a hook type for LWT routes
In LWT tunnels both an input and output route method is defined.
If both of these are executed in the same path then double translation
happens and the effect is not correct.

This patch adds a new attribute that indicates the hook type. Two
values are defined for route output and route output. ILA
translation is only done for the one that is set. The default is
to enable ILA on route output.

Signed-off-by: Tom Herbert <tom@quantonium.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 11:20:49 +09:00
Tom Herbert
70d5aef48a ila: allow configuration of identifier type
Allow identifier to be explicitly configured for a mapping.
This can either be one of the identifier types specified in the
ILA draft or a value of ILA_ATYPE_USE_FORMAT which means the
identifier type is inferred from the identifier type field.
If a value other than ILA_ATYPE_USE_FORMAT is set for a
mapping then it is assumed that the identifier type field is
not present in an identifier.

Signed-off-by: Tom Herbert <tom@quantonium.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 11:20:48 +09:00
Tom Herbert
84287bb328 ila: add checksum neutral map auto
Add checksum neutral auto that performs checksum neutral mapping
without using the C-bit. This is enabled by configuration of
a mapping.

The checksum neutral function has been split into
ila_csum_do_neutral_fmt and ila_csum_do_neutral_nofmt. The former
handles the C-bit and includes it in the adjustment value. The latter
just sets the adjustment value on the locator diff only.

Added configuration for checksum neutral map aut in ila_lwt
and ila_xlat.

Signed-off-by: Tom Herbert <tom@quantonium.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 11:20:48 +09:00
Maciej S. Szmigiero
3928ee6485 net: phy: leds: Add support for "link" trigger
Currently, we create a LED trigger for any link speed known to a PHY.
These triggers only fire when their exact link speed had been negotiated
(they aren't cumulative, that is, they don't fire for "their or any higher"
link speed).

What we are missing, however, is a trigger which will fire on any link
speed known to the PHY. Such trigger can then be used for implementing a
poor man's substitute of the "link" LED on boards that lack it.
Let's add it.

Signed-off-by: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 10:24:14 +09:00
Pablo Neira Ayuso
ba0e4d9917 netfilter: nf_tables: get set elements via netlink
This patch adds a new get operation to look up for specific elements in
a set via netlink interface. You can also use it to check if an interval
already exists.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-11-07 01:00:31 +01:00
Takashi Iwai
3510c7aa06 ALSA: seq: Avoid invalid lockdep class warning
The recent fix for adding rwsem nesting annotation was using the given
"hop" argument as the lock subclass key.  Although the idea itself
works, it may trigger a kernel warning like:
  BUG: looking up invalid subclass: 8
  ....
since the lockdep has a smaller number of subclasses (8) than we
currently allow for the hops there (10).

The current definition is merely a sanity check for avoiding the too
deep delivery paths, and the 8 hops are already enough.  So, as a
quick fix, just follow the max hops as same as the max lockdep
subclasses.

Fixes: 1f20f9ff57 ("ALSA: seq: Fix nested rwsem annotation for lockdep splat")
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
2017-11-06 20:25:31 +01:00
Florian Westphal
5caaed151a netfilter: conntrack: don't cache nlattr_tuple_size result in nla_size
We currently call ->nlattr_tuple_size() once at register time and
cache result in l4proto->nla_size.

nla_size is the only member that is written to, avoiding this would
allow to make l4proto trackers const.

We can use ->nlattr_tuple_size() at run time, and cache result in
the individual trackers instead.

This is an intermediate step, next patch removes nlattr_size()
callback and computes size at compile time, then removes nla_size.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-11-06 16:48:38 +01:00
Takashi Iwai
9b7d869ee5 ALSA: timer: Limit max instances per timer
Currently we allow unlimited number of timer instances, and it may
bring the system hogging way too much CPU when too many timer
instances are opened and processed concurrently.  This may end up with
a soft-lockup report as triggered by syzkaller, especially when
hrtimer backend is deployed.

Since such insane number of instances aren't demanded by the normal
use case of ALSA sequencer and it merely  opens a risk only for abuse,
this patch introduces the upper limit for the number of instances per
timer backend.  As default, it's set to 1000, but for the fine-grained
timer like hrtimer, it's set to 100.

Reported-by: syzbot
Tested-by: Jérôme Glisse <jglisse@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
2017-11-06 10:41:24 +01:00
Linus Torvalds
9d9cc4aa00 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Various fixes:

   - synchronize kernel and tooling headers

   - cgroup support fix

   - two tooling fixes"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tools/headers: Synchronize kernel ABI headers
  perf/cgroup: Fix perf cgroup hierarchy support
  perf tools: Unwind properly location after REJECT
  perf symbols: Fix memory corruption because of zero length symbols
2017-11-05 11:44:39 -08:00
Linus Torvalds
b772b8e3ab Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core fixes from Ingo Molnar:

 - workaround for gcc asm handling

 - futex race fixes

 - objtool build warning fix

 - two watchdog fixes: a crash fix (revert) and a bug fix for
   /proc/sys/kernel/watchdog_thresh handling.

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  objtool: Prevent GCC from merging annotate_unreachable(), take 2
  objtool: Resync objtool's instruction decoder source code copy with the kernel's latest version
  watchdog/hardlockup/perf: Use atomics to track in-use cpu counter
  watchdog/harclockup/perf: Revert a33d44843d ("watchdog/hardlockup/perf: Simplify deferred event destroy")
  futex: Fix more put_pi_state() vs. exit_pi_state_list() races
2017-11-05 11:02:15 -08:00
Roman Gushchin
ebc614f687 bpf, cgroup: implement eBPF-based device controller for cgroup v2
Cgroup v2 lacks the device controller, provided by cgroup v1.
This patch adds a new eBPF program type, which in combination
of previously added ability to attach multiple eBPF programs
to a cgroup, will provide a similar functionality, but with some
additional flexibility.

This patch introduces a BPF_PROG_TYPE_CGROUP_DEVICE program type.
A program takes major and minor device numbers, device type
(block/character) and access type (mknod/read/write) as parameters
and returns an integer which defines if the operation should be
allowed or terminated with -EPERM.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 23:26:51 +09:00
Roman Gushchin
ecf8fecb78 device_cgroup: prepare code for bpf-based device controller
This is non-functional change to prepare the device cgroup code
for adding eBPF-based controller for cgroups v2.

The patch performs the following changes:
1) __devcgroup_inode_permission() and devcgroup_inode_mknod()
   are moving to the device-cgroup.h and converting into static inline.
2) __devcgroup_check_permission() is exported.
3) devcgroup_check_permission() wrapper is introduced to be used
   by both existing and new bpf-based implementations.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 23:26:51 +09:00
David S. Miller
488e5b30d3 Merge tag 'mlx5-updates-2017-11-04' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:

====================
mlx5-updates-2017-11-04

This series includes:

From Huy: dscp to priority mapping for Ethernet packet.

===================================================
First six patches enable differentiated services code point (dscp) to
priority mapping for Ethernet packet. Once this feature is
enabled, the packet is routed to the corresponding priority based on its
dscp. User can combine this feature with priority flow control (pfc)
feature to have priority flow control based on the dscp.

Firmware interface:
Mellanox firmware provides two control knobs for this feature:
  QPTS register allow changing the trust state between dscp and
  pcp mode. The default is pcp mode. Once in dscp mode, firmware will
  route the packet based on its dscp value if the dscp field exists.

  QPDPM register allow mapping a specific dscp (0 to 63) to a
  specific priority (0 to 7). By default, all the dscps are mapped to
  priority zero.

Software interface:
This feature is controlled via application priority TLV. IEEE
specification P802.1Qcd/D2.1 defines priority selector id 5 for
application priority TLV. This APP TLV selector defines DSCP to priority
map. This APP TLV can be sent by the switch or can be set locally using
software such as lldptool. In mlx5 drivers, we add the support for net
dcb's getapp and setapp call back. Mlx5 driver only handles the selector
id 5 application entry (dscp application priority application entry).
If user sends multiple dscp to priority APP TLV entries on the same
dscp, the last sent one will take effect. All the previous sent will be
deleted.

The firmware trust state (in QPTS register) is changed based on the
number of dscp to priority application entries. When the first dscp to
priority application entry is added by the user, the trust state is
changed to dscp. When the last dscp to priority application entry is
deleted by the user, the trust state is changed to pcp.

When the port is in DSCP trust state, the transmit queue is selected
based on the dscp of the skb.

When the port is in DSCP trust state and vport inline mode is not NONE,
firmware requires mlx5 driver to copy the IP header to the
wqe ethernet segment inline header if the skb has it.
This is done by changing the transmit queue sq's min inline mode to L3.
Note that the min inline mode of sqs that belong to other features
such as xdpsq, icosq are not modified.
===================================================

Plus to the dscp series, some small misc changes are include as well:

From Inbar, Ethtool msglvl support and some debug prints in DCBNL logic
From Or Gerlitz, Enlarge the NIC TC offload table size
From Rabie, Initialize destination_flow struct to 0
From Feras, Add inner TTC table to IPoIB flow steering
From Tal, Enable CQE based moderation on TX CQ
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 23:25:02 +09:00
Priyaranjan Jha
1f2556916d tcp: higher throughput under reordering with adaptive RACK reordering wnd
Currently TCP RACK loss detection does not work well if packets are
being reordered beyond its static reordering window (min_rtt/4).Under
such reordering it may falsely trigger loss recoveries and reduce TCP
throughput significantly.

This patch improves that by increasing and reducing the reordering
window based on DSACK, which is now supported in major TCP implementations.
It makes RACK's reo_wnd adaptive based on DSACK and no. of recoveries.

- If DSACK is received, increment reo_wnd by min_rtt/4 (upper bounded
  by srtt), since there is possibility that spurious retransmission was
  due to reordering delay longer than reo_wnd.

- Persist the current reo_wnd value for TCP_RACK_RECOVERY_THRESH (16)
  no. of successful recoveries (accounts for full DSACK-based loss
  recovery undo). After that, reset it to default (min_rtt/4).

- At max, reo_wnd is incremented only once per rtt. So that the new
  DSACK on which we are reacting, is due to the spurious retx (approx)
  after the reo_wnd has been updated last time.

- reo_wnd is tracked in terms of steps (of min_rtt/4), rather than
  absolute value to account for change in rtt.

In our internal testing, we observed significant increase in throughput,
in scenarios where reordering exceeds min_rtt/4 (previous static value).

Signed-off-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 23:15:42 +09:00
Vivien Didelot
49463b7f2d net: dsa: make tree index unsigned
Similarly to a DSA switch and port, rename the tree index from "tree" to
"index" and make it an unsigned int because it isn't supposed to be less
than 0.

u32 is an OF specific data used to retrieve the value and has no need to
be propagated up to the tree index.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 22:31:38 +09:00
Vivien Didelot
99feaafcdb net: dsa: make switch index unsigned
Define the DSA switch index as an unsigned int, because it will never be
less than 0.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 22:31:38 +09:00
Jakub Kicinski
b37a530613 bpf: remove old offload/analyzer
Thanks to the ability to load a program for a specific device,
running verifier twice is no longer needed.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 22:26:20 +09:00
Jakub Kicinski
248f346ffe xdp: allow attaching programs loaded for specific device
Pass the netdev pointer to bpf_prog_get_type().  This way
BPF code can decide whether the device matches what the
code was loaded/translated for.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 22:26:19 +09:00
Jakub Kicinski
bd601b6ada bpf: report offload info to user space
Extend struct bpf_prog_info to contain information about program
being bound to a device.  Since the netdev may get destroyed while
program still exists we need a flag to indicate the program is
loaded for a device, even if the device is gone.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 22:26:18 +09:00
Jakub Kicinski
ab3f0063c4 bpf: offload: add infrastructure for loading programs for a specific netdev
The fact that we don't know which device the program is going
to be used on is quite limiting in current eBPF infrastructure.
We have to reverse or limit the changes which kernel makes to
the loaded bytecode if we want it to be offloaded to a networking
device.  We also have to invent new APIs for debugging and
troubleshooting support.

Make it possible to load programs for a specific netdev.  This
helps us to bring the debug information closer to the core
eBPF infrastructure (e.g. we will be able to reuse the verifer
log in device JIT).  It allows device JITs to perform translation
on the original bytecode.

__bpf_prog_get() when called to get a reference for an attachment
point will now refuse to give it if program has a device assigned.
Following patches will add a version of that function which passes
the expected netdev in. @type argument in __bpf_prog_get() is
renamed to attach_type to make it clearer that it's only set on
attachment.

All calls to ndo_bpf are protected by rtnl, only verifier callbacks
are not.  We need a wait queue to make sure netdev doesn't get
destroyed while verifier is still running and calling its driver.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 22:26:18 +09:00
Jakub Kicinski
f4e63525ee net: bpf: rename ndo_xdp to ndo_bpf
ndo_xdp is a control path callback for setting up XDP in the
driver.  We can reuse it for other forms of communication
between the eBPF stack and the drivers.  Rename the callback
and associated structures and definitions.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 22:26:18 +09:00
Jiri Benc
79e1ad148c rtnetlink: use netnsid to query interface
Currently, when an application gets netnsid from the kernel (for example as
the result of RTM_GETLINK call on one end of the veth pair), it's not much
useful. There's no reliable way to get to the netns fd from the netnsid, nor
does any kernel API accept netnsid.

Extend the RTM_GETLINK call to also accept netnsid. It will operate on the
netns with the given netnsid in such case. Of course, the calling process
needs to have enough capabilities in the target name space; for now, require
CAP_NET_ADMIN. This can be relaxed in the future.

To signal to the calling process that the kernel understood the new
IFLA_IF_NETNSID attribute in the query, it will include it in the response.
This is needed to detect older kernels, as they will just ignore
IFLA_IF_NETNSID and query in the current name space.

This patch implemetns IFLA_IF_NETNSID only for get and dump. For set
operations, this can be extended later.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 21:49:17 +09:00