commit c364b6d0b6 upstream.
In a bmapx call, bmv_count is the total size of the array, including the
zeroth element that userspace uses to supply the search key. The output
array starts at offset 1 so that we can set up the user for the next
invocation. Since we now can split an extent into multiple bmap records
due to shared/unshared status, we have to be careful that we don't
overflow the output array.
In the original patch f86f403794 ("xfs: teach get_bmapx about shared
extents and the CoW fork") I used cur_ext (the output index) to check
for overflows, albeit with an off-by-one error. Since nexleft no longer
describes the number of unfilled slots in the output, we can rip all
that out and use cur_ext for the overflow check directly.
Failure to do this causes heap corruption in bmapx callers such as
xfs_io and xfs_scrub. xfs/328 can reproduce this problem.
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2aa6ba7b5a upstream.
If we try to allocate memory pages to back an xfs_buf that we're trying
to read, it's possible that we'll be so short on memory that the page
allocation fails. For a blocking read we'll just wait, but for
readahead we simply dump all the pages we've collected so far.
Unfortunately, after dumping the pages we neglect to clear the
_XBF_PAGES state, which means that the subsequent call to xfs_buf_free
thinks that b_pages still points to pages we own. It then double-frees
the b_pages pages.
This results in screaming about negative page refcounts from the memory
manager, which xfs oughtn't be triggering. To reproduce this case,
mount a filesystem where the size of the inodes far outweighs the
availalble memory (a ~500M inode filesystem on a VM with 300MB memory
did the trick here) and run bulkstat in parallel with other memory
eating processes to put a huge load on the system. The "check summary"
phase of xfs_scrub also works for this purpose.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 493611ebd6 upstream.
With COW files they are the hotpath, just like for files with the
extent size hint attribute. We really shouldn't micro-manage anything
but failure cases with unlikely.
Additionally Arnd Bergmann recently reported that one of these two
unlikely annotations causes link failures together with an upcoming
kernel instrumentation patch, so let's get rid of it ASAP.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5a93790d4e upstream.
xfs_attr_[get|remove]() have unlocked attribute fork checks to optimize
away a lock cycle in cases where the fork does not exist or is otherwise
empty. This check is not safe, however, because an attribute fork short
form to extent format conversion includes a transient state that causes
the xfs_inode_hasattr() check to fail. Specifically,
xfs_attr_shortform_to_leaf() creates an empty extent format attribute
fork and then adds the existing shortform attributes to it.
This means that lookup of an existing xattr can spuriously return
-ENOATTR when racing against a setxattr that causes the associated
format conversion. This was originally reproduced by an untar on a
particularly configured glusterfs volume, but can also be reproduced on
demand with properly crafted xattr requests.
The format conversion occurs under the exclusive ilock. xfs_attr_get()
and xfs_attr_remove() already have the proper locking and checks further
down in the functions to handle this situation correctly. Drop the
unlocked checks to avoid the spurious failure and rely on the existing
logic.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 83d230eb5c upstream.
sb_dirblklog is added to sb_blocklog to compute the directory block size
in bytes. Therefore, we must compare the sum of both those values
against XFS_MAX_BLOCKSIZE_LOG, not just dirblklog.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d2b3964a07 upstream.
Due to the way how xfs_iomap_write_allocate tries to convert the whole
found extents from delalloc to real space we can run into a race
condition with multiple threads doing writes to this same extent.
For the non-COW case that is harmless as the only thing that can happen
is that we call xfs_bmapi_write on an extent that has already been
converted to a real allocation. For COW writes where we move the extent
from the COW to the data fork after I/O completion the race is, however,
not quite as harmless. In the worst case we are now calling
xfs_bmapi_write on a region that contains hole in the COW work, which
will trip up an assert in debug builds or lead to file system corruption
in non-debug builds. This seems to be reproducible with workloads of
small O_DSYNC write, although so far I've not managed to come up with
a with an isolated reproducer.
The fix for the issue is relatively simple: tell xfs_bmapi_write
that we are only asked to convert delayed allocations and skip holes
in that case.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fd29f7af75 upstream.
A harmless warning just got introduced:
fs/xfs/libxfs/xfs_dir2.h:40:8: error: type qualifiers ignored on function return type [-Werror=ignored-qualifiers]
Removing the 'const' modifier avoids the warning and has no
other effect.
Fixes: 1fc4d33fed ("xfs: replace xfs_mode_to_ftype table with switch statement")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 657bdfb7f5 upstream.
The GETNEXTQOTA ioctl takes whatever ID is sent in,
and looks for the next active quota for an user
equal or higher to that ID.
But if we are at the maximum ID and then ask for the "next"
one, we may wrap back to zero. In this case, userspace
may loop forever, because it will start querying again
at zero.
We'll fix this in userspace as well, but for the kernel,
return -ENOENT if we ask for the next quota ID
past UINT_MAX so the caller knows to stop.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fab8eef86c upstream.
The helper xfs_dentry_to_name() is used by 2 different
classes of callers: Callers that pass zero mode and don't care
about the returned name.type field and Callers that pass
non zero mode and do care about the name.type field.
Change xfs_dentry_to_name() to not take the mode argument and
change the call sites of the first class to not pass the mode
argument.
Create a new helper xfs_dentry_mode_to_name() which does pass
the mode argument and returns -EFSCORRUPTED if mode is invalid.
Callers that translate non zero mode to on-disk file type now
check the return value and will export the error to user instead
of staging an invalid file type to be written to directory entry.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1fc4d33fed.
The size of the xfs_mode_to_ftype[] conversion table
was too small to handle an invalid value of mode=S_IFMT.
Instead of fixing the table size, replace the conversion table
with a conversion helper that uses a switch statement.
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b597dd5373 upstream.
xfs_dir2.h dereferences some data types in inline functions
and fails to include those type definitions, e.g.:
xfs_dir2_data_aoff_t, struct xfs_da_geometry.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3c6f46eacd upstream.
This changes fixes an assertion hit when fuzzing on-disk
i_mode values.
The easy case to fix is when changing an empty file
i_mode to S_IFDIR. In this case, xfs_dinode_verify()
detects an illegal zero size for directory and fails
to load the inode structure from disk.
For the case of non empty file whose i_mode is changed
to S_IFDIR, the ASSERT() statement in xfs_dir2_isblock()
is replaced with return -EFSCORRUPTED, to avoid interacting
with corrupted jusk also when XFS_DEBUG is disabled.
Suggested-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 84a4620cfe upstream.
There are only two reasons for xfs_log_force / xfs_log_force_lsn to fail:
one is an I/O error, for which xlog_bdstrat already logs a warning, and
the second is an already shutdown log due to a previous I/O errors. In
the latter case we'll already have a previous indication for the actual
error, but the large stream of misleading warnings from xfs_log_force
will probably scroll it out of the message buffer.
Simply removing the warnings thus makes the XFS log reporting significantly
better.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 12ef830198 upstream.
->total is a bit of an odd parameter passed down to the low-level
allocator all the way from the high-level callers. It's supposed to
contain the maximum number of blocks to be allocated for the whole
transaction [1].
But in xfs_iomap_write_allocate we only convert existing delayed
allocations and thus only have a minimal block reservation for the
current transaction, so xfs_alloc_space_available can't use it for
the allocation decisions. Use the maximum of args->total and the
calculated block requirement to make a decision. We probably should
get rid of args->total eventually and instead apply ->minleft more
broadly, but that will require some extensive changes all over.
[1] which creates lots of confusion as most callers don't decrement it
once doing a first allocation. But that's for a separate series.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 54fee133ad upstream.
We must decide in xfs_alloc_fix_freelist if we can perform an
allocation from a given AG is possible or not based on the available
space, and should not fail the allocation past that point on a
healthy file system.
But currently we have two additional places that second-guess
xfs_alloc_fix_freelist: xfs_alloc_ag_vextent tries to adjust the
maxlen parameter to remove the reservation before doing the
allocation (but ignores the various minium freespace requirements),
and xfs_alloc_fix_minleft tries to fix up the allocated length
after we've found an extent, but ignores the reservations and also
doesn't take the AGFL into account (and thus fails allocations
for not matching minlen in some cases).
Remove all these later fixups and just correct the maxlen argument
inside xfs_alloc_fix_freelist once we have the AGF buffer locked.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 255c516278 upstream.
We can't just set minleft to 0 when we're low on space - that's exactly
what we need minleft for: to protect space in the AG for btree block
allocations when we are low on free space.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5149fd327f upstream.
Setting aside 4 blocks globally for bmbt splits isn't all that useful,
as different threads can allocate space in parallel. Bump it to 4
blocks per AG to allow each thread that is currently doing an
allocation to dip into it separately. Without that we may no have
enough reserved blocks if there are enough parallel transactions
in an almost out space file system that all run into bmap btree
splits.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit f154be241d ]
Commit 448b4482c6 ("net: dsa: Add lockdep class to tx queues to avoid
lockdep splat") removed the netif_device_detach() call done in
dsa_slave_suspend() which is necessary, and paired with a corresponding
netif_device_attach(), bring it back.
Fixes: 448b4482c6 ("net: dsa: Add lockdep class to tx queues to avoid lockdep splat")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 85c814016c ]
When attempting to free lwtunnel state after the module for the encap
has been unloaded an oops occurs:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: lwtstate_free+0x18/0x40
[..]
task: ffff88003e372380 task.stack: ffffc900001fc000
RIP: 0010:lwtstate_free+0x18/0x40
RSP: 0018:ffff88003fd83e88 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88002bbb3380 RCX: ffff88000c91a300
[..]
Call Trace:
<IRQ>
free_fib_info_rcu+0x195/0x1a0
? rt_fibinfo_free+0x50/0x50
rcu_process_callbacks+0x2d3/0x850
? rcu_process_callbacks+0x296/0x850
__do_softirq+0xe4/0x4cb
irq_exit+0xb0/0xc0
smp_apic_timer_interrupt+0x3d/0x50
apic_timer_interrupt+0x93/0xa0
[..]
Code: e8 6e c6 fc ff 89 d8 5b 5d c3 bb de ff ff ff eb f4 66 90 66 66 66 66 90 55 48 89 e5 53 0f b7 07 48 89 fb 48 8b 04 c5 00 81 d5 81 <48> 8b 40 08 48 85 c0 74 13 ff d0 48 8d 7b 20 be 20 00 00 00 e8
The problem is after the module for the encap can be unloaded the
corresponding ops is removed and is thus NULL here.
Modules implementing lwtunnel ops should not be allowed to unload
while there is state alive using those ops, so grab the module
reference for the ops on creating lwtunnel state and of course release
the reference when freeing the state.
Fixes: 1104d9ba44 ("lwtunnel: Add destroy state operation")
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 88ff7334f2 ]
Modules implementing lwtunnel ops should not be allowed to unload
while there is state alive using those ops, so specify the owning
module for all lwtunnel ops.
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 5b9f575163 ]
Another rebranded Novatel E371. qmi_wwan should drive this device, while
cdc_ether should ignore it. Even though the USB descriptors are plain
CDC-ETHER that USB interface is a QMI interface. Ref commit 7fdb7846c9
("qmi_wwan/cdc_ether: add device IDs for Dell 5804 (Novatel E371) WWAN
card")
Cc: Dan Williams <dcbw@redhat.com>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 0fb44559ff ]
Dmitry reported a deadlock scenario:
unix_bind() path:
u->bindlock ==> sb_writer
do_splice() path:
sb_writer ==> pipe->mutex ==> u->bindlock
In the unix_bind() code path, unix_mknod() does not have to
be done with u->bindlock held, since it is a pure fs operation,
so we can just move unix_mknod() out.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 9f427a0e47 ]
MPLS multipath for LSR is broken -- always selecting the first nexthop
in the one label case. For example:
$ ip -f mpls ro ls
100
nexthop as to 200 via inet 172.16.2.2 dev virt12
nexthop as to 300 via inet 172.16.3.2 dev virt13
101
nexthop as to 201 via inet6 2000:2::2 dev virt12
nexthop as to 301 via inet6 2000:3::2 dev virt13
In this example incoming packets have a single MPLS labels which means
BOS bit is set. The BOS bit is passed from mpls_forward down to
mpls_multipath_hash which never processes the hash loop because BOS is 1.
Update mpls_multipath_hash to process the entire label stack. mpls_hdr_len
tracks the total mpls header length on each pass (on pass N mpls_hdr_len
is N * sizeof(mpls_shim_hdr)). When the label is found with the BOS set
it verifies the skb has sufficient header for ipv4 or ipv6, and find the
IPv4 and IPv6 header by using the last mpls_hdr pointer and adding 1 to
advance past it.
With these changes I have verified the code correctly sees the label,
BOS, IPv4 and IPv6 addresses in the network header and icmp/tcp/udp
traffic for ipv4 and ipv6 are distributed across the nexthops.
Fixes: 1c78efa831 ("mpls: flow-based multipath selection")
Acked-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit b6677449df ]
Any bridge options specified during link creation (e.g. ip link add)
are ignored as br_dev_newlink() does not process them.
Use br_changelink() to do it.
Fixes: 1332351617 ("bridge: implement rtnl_link_ops->changelink")
Signed-off-by: Ivan Vecera <cera@cera.cz>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit e048fc50d7 ]
A driver using dev_alloc_page() must not reuse a page allocated from
emergency memory reserve.
Otherwise all packets using this page will be immediately dropped,
unless for very specific sockets having SOCK_MEMALLOC bit set.
This issue might be hard to debug, because only a fraction of received
packets would be dropped.
Fixes: 4415a0319f ("net/mlx5e: Implement RX mapped page cache for page recycle")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 0dbd7ff3ac ]
Found that if we run LTP netstress test with large MSS (65K),
the first attempt from server to send data comparable to this
MSS on fastopen connection will be delayed by the probe timer.
Here is an example:
< S seq 0:0 win 43690 options [mss 65495 wscale 7 tfo cookie] length 32
> S. seq 0:0 ack 1 win 43690 options [mss 65495 wscale 7] length 0
< . ack 1 win 342 length 0
Inside tcp_sendmsg(), tcp_send_mss() returns max MSS in 'mss_now',
as well as in 'size_goal'. This results the segment not queued for
transmition until all the data copied from user buffer. Then, inside
__tcp_push_pending_frames(), it breaks on send window test and
continues with the check probe timer.
Fragmentation occurs in tcp_write_wakeup()...
+0.2 > P. seq 1:43777 ack 1 win 342 length 43776
< . ack 43777, win 1365 length 0
> P. seq 43777:65001 ack 1 win 342 options [...] length 21224
...
This also contradicts with the fact that we should bound to the half
of the window if it is large.
Fix this flaw by correctly initializing max_window. Before that, it
could have large values that affect further calculations of 'size_goal'.
Fixes: 168a8f5805 ("tcp: TCP Fast Open Server - main code path")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 03e4deff49 ]
Just like commit 4acd4945cd ("ipv6: addrconf: Avoid calling
netdevice notifiers with RCU read-side lock"), it is unnecessary
to make addrconf_disable_change() use RCU iteration over the
netdev list, since it already holds the RTNL lock, or we may meet
Illegal context switch in RCU read-side critical section.
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 9ed59592e3]
Trying to add an mpls encap route when the MPLS modules are not loaded
hangs. For example:
CONFIG_MPLS=y
CONFIG_NET_MPLS_GSO=m
CONFIG_MPLS_ROUTING=m
CONFIG_MPLS_IPTUNNEL=m
$ ip route add 10.10.10.10/32 encap mpls 100 via inet 10.100.1.2
The ip command hangs:
root 880 826 0 21:25 pts/0 00:00:00 ip route add 10.10.10.10/32 encap mpls 100 via inet 10.100.1.2
$ cat /proc/880/stack
[<ffffffff81065a9b>] call_usermodehelper_exec+0xd6/0x134
[<ffffffff81065efc>] __request_module+0x27b/0x30a
[<ffffffff814542f6>] lwtunnel_build_state+0xe4/0x178
[<ffffffff814aa1e4>] fib_create_info+0x47f/0xdd4
[<ffffffff814ae451>] fib_table_insert+0x90/0x41f
[<ffffffff814a8010>] inet_rtm_newroute+0x4b/0x52
...
modprobe is trying to load rtnl-lwt-MPLS:
root 881 5 0 21:25 ? 00:00:00 /sbin/modprobe -q -- rtnl-lwt-MPLS
and it hangs after loading mpls_router:
$ cat /proc/881/stack
[<ffffffff81441537>] rtnl_lock+0x12/0x14
[<ffffffff8142ca2a>] register_netdevice_notifier+0x16/0x179
[<ffffffffa0033025>] mpls_init+0x25/0x1000 [mpls_router]
[<ffffffff81000471>] do_one_initcall+0x8e/0x13f
[<ffffffff81119961>] do_init_module+0x5a/0x1e5
[<ffffffff810bd070>] load_module+0x13bd/0x17d6
...
The problem is that lwtunnel_build_state is called with rtnl lock
held preventing mpls_init from registering.
Given the potential references held by the time lwtunnel_build_state it
can not drop the rtnl lock to the load module. So, extract the module
loading code from lwtunnel_build_state into a new function to validate
the encap type. The new function is called while converting the user
request into a fib_config which is well before any table, device or
fib entries are examined.
Fixes: 745041e2aa ("lwtunnel: autoload of lwt modules")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit cd33b3e0da ]
Commit a1cba5613e ("net: phy: Add Broadcom phy library for common
interfaces") make the BCM63xx PHY driver utilize bcm_phy_config_intr()
which would appear to do the right thing, except that it does not write
to the MII_BCM63XX_IR register but to MII_BCM54XX_ECR which is
different.
This would be causing invalid link parameters and events from being
generated by the PHY interrupt.
Fixes: a1cba5613e ("net: phy: Add Broadcom phy library for common interfaces")
Signed-off-by: Daniel Gonzalez Cabanelas <dgcbueu@gmail.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 7be2c82cfd ]
Ashizuka reported a highmem oddity and sent a patch for freescale
fec driver.
But the problem root cause is that core networking stack
must ensure no skb with highmem fragment is ever sent through
a device that does not assert NETIF_F_HIGHDMA in its features.
We need to call illegal_highdma() from harmonize_features()
regardless of CSUM checks.
Fixes: ec5f061564 ("net: Kill link between CSUM and SG features.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Pravin Shelar <pshelar@ovn.org>
Reported-by: "Ashizuka, Yuusuke" <ashiduka@jp.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit d5ff72d9af ]
vxlan->cfg.dst_port is in network byte order, so an htons()
is needed here. Also reduced comment length to stay closer
to 80 column width (still slightly over, however).
Fixes: e1e5314de0 ("vxlan: implement GPE")
Signed-off-by: Lance Richardson <lrichard@redhat.com>
Acked-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 501db51139 ]
This patch part reverts fd2a0437dc and e858fae2b0 which introduced a
subtle change in how the virtio_net flags are derived from the SKBs
ip_summed field.
With the above commits, the flags are set to VIRTIO_NET_HDR_F_DATA_VALID
when ip_summed == CHECKSUM_UNNECESSARY, thus treating it differently to
ip_summed == CHECKSUM_NONE, which should be the same.
Further, the virtio spec 1.0 / CS04 explicitly says that
VIRTIO_NET_HDR_F_DATA_VALID must not be set by the driver.
Fixes: fd2a0437dc ("virtio_net: introduce virtio_net_hdr_{from,to}_skb")
Fixes: e858fae2b0 (" virtio_net: use common code for virtio_net_hdr and skb GSO conversion")
Signed-off-by: Rolf Neugebauer <rolf.neugebauer@docker.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 0faa9cb5b3 ]
Demonstrating the issue:
.. add a drop action
$sudo $TC actions add action drop index 10
.. retrieve it
$ sudo $TC -s actions get action gact index 10
action order 1: gact action drop
random type none pass val 0
index 10 ref 2 bind 0 installed 29 sec used 29 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
... bug 1 above: reference is two.
Reference is actually 1 but we forget to subtract 1.
... do a GET again and we see the same issue
try a few times and nothing changes
~$ sudo $TC -s actions get action gact index 10
action order 1: gact action drop
random type none pass val 0
index 10 ref 2 bind 0 installed 31 sec used 31 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
... lets try to bind the action to a filter..
$ sudo $TC qdisc add dev lo ingress
$ sudo $TC filter add dev lo parent ffff: protocol ip prio 1 \
u32 match ip dst 127.0.0.1/32 flowid 1:1 action gact index 10
... and now a few GETs:
$ sudo $TC -s actions get action gact index 10
action order 1: gact action drop
random type none pass val 0
index 10 ref 3 bind 1 installed 204 sec used 204 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
$ sudo $TC -s actions get action gact index 10
action order 1: gact action drop
random type none pass val 0
index 10 ref 4 bind 1 installed 206 sec used 206 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
$ sudo $TC -s actions get action gact index 10
action order 1: gact action drop
random type none pass val 0
index 10 ref 5 bind 1 installed 235 sec used 235 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
.... as can be observed the reference count keeps going up.
After the fix
$ sudo $TC actions add action drop index 10
$ sudo $TC -s actions get action gact index 10
action order 1: gact action drop
random type none pass val 0
index 10 ref 1 bind 0 installed 4 sec used 4 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
$ sudo $TC -s actions get action gact index 10
action order 1: gact action drop
random type none pass val 0
index 10 ref 1 bind 0 installed 6 sec used 6 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
$ sudo $TC qdisc add dev lo ingress
$ sudo $TC filter add dev lo parent ffff: protocol ip prio 1 \
u32 match ip dst 127.0.0.1/32 flowid 1:1 action gact index 10
$ sudo $TC -s actions get action gact index 10
action order 1: gact action drop
random type none pass val 0
index 10 ref 2 bind 1 installed 32 sec used 32 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
$ sudo $TC -s actions get action gact index 10
action order 1: gact action drop
random type none pass val 0
index 10 ref 2 bind 1 installed 33 sec used 33 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
Fixes: aecc5cefc3 ("net sched actions: fix GETing actions")
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 8a367e74c0 ]
The ax.25 socket connection timed out & the sock struct has been
previously taken down ie. sock struct is now a NULL pointer. Checking
the sock_flag causes the segfault. Check if the socket struct pointer
is NULL before checking sock_flag. This segfault is seen in
timed out netrom connections.
Please submit to -stable.
Signed-off-by: Basil Gunn <basil@pacabunga.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 02ca0423fd ]
With ip6gre we have a tunnel header which also makes the tunnel MTU
smaller. We need to reserve room for it. Previously we were using up
space reserved for the Tunnel Encapsulation Limit option
header (RFC 2473).
Also, after commit b05229f442 ("gre6: Cleanup GREv6 transmit path,
call common GRE functions") our contract with the caller has
changed. Now we check if the packet length exceeds the tunnel MTU after
the tunnel header has been pushed, unlike before.
This is reflected in the check where we look at the packet length minus
the size of the tunnel header, which is already accounted for in tunnel
MTU.
Fixes: b05229f442 ("gre6: Cleanup GREv6 transmit path, call common GRE functions")
Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 003c941057 ]
Fix up a data alignment issue on sparc by swapping the order
of the cookie byte array field with the length field in
struct tcp_fastopen_cookie, and making it a proper union
to clean up the typecasting.
This addresses log complaints like these:
log_unaligned: 113 callbacks suppressed
Kernel unaligned access at TPC[976490] tcp_try_fastopen+0x2d0/0x360
Kernel unaligned access at TPC[9764ac] tcp_try_fastopen+0x2ec/0x360
Kernel unaligned access at TPC[9764c8] tcp_try_fastopen+0x308/0x360
Kernel unaligned access at TPC[9764e4] tcp_try_fastopen+0x324/0x360
Kernel unaligned access at TPC[976490] tcp_try_fastopen+0x2d0/0x360
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 148d3d021c ]
The __bcm_sysport_tx_reclaim() function is used to reclaim transmit
resources in different places within the driver. Most of them should
not affect the state of the transit flow control.
Introduce bcm_sysport_tx_clean() which cleans the ring, but does not
re-enable flow control towards the networking stack, and make
bcm_sysport_tx_reclaim() do the actual transmit queue flow control.
Fixes: 80105befdb ("net: systemport: add Broadcom SYSTEMPORT Ethernet MAC driver")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 8a430ed50b ]
rtm_table is an 8-bit field while table ids are allowed up to u32. Commit
709772e6e0 ("net: Fix routing tables with id > 255 for legacy software")
added the preference to set rtm_table in dumps to RT_TABLE_COMPAT if the
table id is > 255. The table id returned on get route requests should do
the same.
Fixes: c36ba6603a ("net: Allow user to get table id from route lookup")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit ea7a80858f ]
Handle failure in lwtunnel_fill_encap adding attributes to skb.
Fixes: 571e722676 ("ipv4: support for fib route lwtunnel encap attributes")
Fixes: 19e42e4515 ("ipv6: support for fib route lwtunnel encap attributes")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 28e46a0f2e ]
The event_data starts from address 0x00-0x0C and not from 0x08-0x014. This
leads to duplication with other fields in the Event Queue Element such as
sub-type, cqn and owner.
Fixes: eda6500a98 ("mlxsw: Add PCI bus implementation")
Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 400fc0106d ]
During transmission the skb is checked for headroom in order to
add vendor specific header. In case the skb needs to be re-allocated,
skb_realloc_headroom() is called to make a private copy of the original,
but doesn't release it. Current code assumes that the original skb is
released during reallocation and only releases it at the error path
which causes a memory leak.
Fix this by adding the original skb release to the main path.
Fixes: d003462a50 ("mlxsw: Simplify mlxsw_sx_port_xmit function")
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 36bf38d158 ]
During transmission the skb is checked for headroom in order to
add vendor specific header. In case the skb needs to be re-allocated,
skb_realloc_headroom() is called to make a private copy of the original,
but doesn't release it. Current code assumes that the original skb is
released during reallocation and only releases it at the error path
which causes a memory leak.
Fix this by adding the original skb release to the main path.
Fixes: 56ade8fe3f ("mlxsw: spectrum: Add initial support for Spectrum ASIC")
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>