commit bd7a3fe770 upstream.
Andrey Konovalov reported a possible out-of-bounds problem for a USB interface
association descriptor. He writes:
It seems there's no proper size check of a USB_DT_INTERFACE_ASSOCIATION
descriptor. It's only checked that the size is >= 2 in
usb_parse_configuration(), so find_iad() might do out-of-bounds access
to intf_assoc->bInterfaceCount.
And he's right, we don't check for crazy descriptors of this type very well, so
resolve this problem. Yet another issue found by syzkaller...
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 568af6de05 ]
Phil Sutter reports that IPv6 AH header matching is broken. From
userspace, nft generates bytecode that expects to find the AH header at
NFT_PAYLOAD_TRANSPORT_HEADER both for IPv4 and IPv6. However,
pktinfo->thoff is set to the inner header after the AH header in IPv6,
while in IPv4 pktinfo->thoff points to the AH header indeed. This
behaviour is inconsistent. This patch fixes this problem by updating
ipv6_find_hdr() to get the IP6_FH_F_AUTH flag so this function stops at
the AH header, so both IPv4 and IPv6 pktinfo->thoff point to the AH
header.
This is also inconsistent when trying to match encapsulated headers:
1) A packet that looks like IPv4 + AH + TCP dport 22 will *not* match.
2) A packet that looks like IPv6 + AH + TCP dport 22 will match.
Reported-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 5ef1ecf060 ]
Certain 64-bit systems (e.g. Amlogic Meson GX) require buffers to be
used for DMA to be 8-byte-aligned. struct sdio_func has an embedded
small DMA buffer not meeting this requirement.
When testing switching to descriptor chain mode in meson-gx driver
SDIO is broken therefore. Fix this by allocating the small DMA buffer
separately as kmalloc ensures that the returned memory area is
properly aligned for every basic data type.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Tested-by: Helmut Klein <hgkr.klein@gmail.com>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 237bbd29f7 upstream.
It was possible for an unprivileged user to create the user and user
session keyrings for another user. For example:
sudo -u '#3000' sh -c 'keyctl add keyring _uid.4000 "" @u
keyctl add keyring _uid_ses.4000 "" @u
sleep 15' &
sleep 1
sudo -u '#4000' keyctl describe @u
sudo -u '#4000' keyctl describe @us
This is problematic because these "fake" keyrings won't have the right
permissions. In particular, the user who created them first will own
them and will have full access to them via the possessor permissions,
which can be used to compromise the security of a user's keys:
-4: alswrv-----v------------ 3000 0 keyring: _uid.4000
-5: alswrv-----v------------ 3000 0 keyring: _uid_ses.4000
Fix it by marking user and user session keyrings with a flag
KEY_FLAG_UID_KEYRING. Then, when searching for a user or user session
keyring by name, skip all keyrings that don't have the flag set.
Fixes: 69664cf16a ("keys: don't generate user and user session keyrings unless they're accessed")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5316821590 upstream.
With TXQs, the AP_VLAN interfaces are resolved to their owner AP
interface when enqueuing the frame, which makes sense since the
frame really goes out on that as far as the driver is concerned.
However, this introduces a problem: frames to be encrypted with
a VLAN-specific GTK will now be encrypted with the AP GTK, since
the information about which virtual interface to use to select
the key is taken from the TXQ.
Fix this by preserving info->control.vif and using that in the
dequeue function. This now requires doing the driver-mapping
in the dequeue as well.
Since there's no way to filter the frames that are sitting on a
TXQ, drop all frames, which may affect other interfaces, when an
AP_VLAN is removed.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e652399edb upstream.
Version 5 CCPs have some new requirements for XTS-AES: the type field
must be specified, and the key requires 512 bits, with each part
occupying 256 bits and padded with zeroes.
Signed-off-by: Gary R Hook <ghook@amd.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 979990c628 upstream.
kernelci.org reports a crazy stack usage for the VT code when CONFIG_KASAN
is enabled:
drivers/tty/vt/keyboard.c: In function 'kbd_keycode':
drivers/tty/vt/keyboard.c:1452:1: error: the frame size of 2240 bytes is larger than 2048 bytes [-Werror=frame-larger-than=]
The problem is that tty_insert_flip_char() gets inlined many times into
kbd_keycode(), and also into other functions, and each copy requires 128
bytes for stack redzone to check for a possible out-of-bounds access on
the 'ch' and 'flags' arguments that are passed into
tty_insert_flip_string_flags as a variable-length string.
This introduces a new __tty_insert_flip_char() function for the slow
path, which receives the two arguments by value. This completely avoids
the problem and the stack usage goes back down to around 100 bytes.
Without KASAN, this is also slightly better, as we don't have to
spill the arguments to the stack but can simply pass 'ch' and 'flag'
in registers, saving a few bytes in .text for each call site.
This should be backported to linux-4.0 or later, which first introduced
the stack sanitizer in the kernel.
Fixes: c420f167db ("kasan: enable stack instrumentation")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 799ea9e9c5 upstream.
When we introduced the bmap redo log items, we set MS_ACTIVE on the
mountpoint and XFS_IRECOVERY on the inode to prevent unlinked inodes
from being truncated prematurely during log recovery. This also had the
effect of putting linked inodes on the lru instead of evicting them.
Unfortunately, we neglected to find all those unreferenced lru inodes
and evict them after finishing log recovery, which means that we leak
them if anything goes wrong in the rest of xfs_mountfs, because the lru
is only cleaned out on unmount.
Therefore, evict unreferenced inodes in the lru list immediately
after clearing MS_ACTIVE.
Fixes: 17c12bcd30 ("xfs: when replaying bmap operations, don't let unlinked inodes get reaped")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Cc: viro@ZenIV.linux.org.uk
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 5a63643e58 ]
This reverts commit 1d6119baf0.
After reverting commit 6d7b857d54 ("net: use lib/percpu_counter API
for fragmentation mem accounting") then here is no need for this
fix-up patch. As percpu_counter is no longer used, it cannot
memory leak it any-longer.
Fixes: 6d7b857d54 ("net: use lib/percpu_counter API for fragmentation mem accounting")
Fixes: 1d6119baf0 ("net: fix percpu memory leaks")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit fb452a1aa3 ]
This reverts commit 6d7b857d54.
There is a bug in fragmentation codes use of the percpu_counter API,
that can cause issues on systems with many CPUs.
The frag_mem_limit() just reads the global counter (fbc->count),
without considering other CPUs can have upto batch size (130K) that
haven't been subtracted yet. Due to the 3MBytes lower thresh limit,
this become dangerous at >=24 CPUs (3*1024*1024/130000=24).
The correct API usage would be to use __percpu_counter_compare() which
does the right thing, and takes into account the number of (online)
CPUs and batch size, to account for this and call __percpu_counter_sum()
when needed.
We choose to revert the use of the lib/percpu_counter API for frag
memory accounting for several reasons:
1) On systems with CPUs > 24, the heavier fully locked
__percpu_counter_sum() is always invoked, which will be more
expensive than the atomic_t that is reverted to.
Given systems with more than 24 CPUs are becoming common this doesn't
seem like a good option. To mitigate this, the batch size could be
decreased and thresh be increased.
2) The add_frag_mem_limit+sub_frag_mem_limit pairs happen on the RX
CPU, before SKBs are pushed into sockets on remote CPUs. Given
NICs can only hash on L2 part of the IP-header, the NIC-RXq's will
likely be limited. Thus, a fair chance that atomic add+dec happen
on the same CPU.
Revert note that commit 1d6119baf0 ("net: fix percpu memory leaks")
removed init_frag_mem_limit() and instead use inet_frags_init_net().
After this revert, inet_frags_uninit_net() becomes empty.
Fixes: 6d7b857d54 ("net: use lib/percpu_counter API for fragmentation mem accounting")
Fixes: 1d6119baf0 ("net: fix percpu memory leaks")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 25cc72a338 ]
The mlxsw driver relies on NETDEV_CHANGEUPPER events to configure the
device in case a port is enslaved to a master netdev such as bridge or
bond.
Since the driver ignores events unrelated to its ports and their
uppers, it's possible to engineer situations in which the device's data
path differs from the kernel's.
One example to such a situation is when a port is enslaved to a bond
that is already enslaved to a bridge. When the bond was enslaved the
driver ignored the event - as the bond wasn't one of its uppers - and
therefore a bridge port instance isn't created in the device.
Until such configurations are supported forbid them by checking that the
upper device doesn't have uppers of its own.
Fixes: 0d65fc1304 ("mlxsw: spectrum: Implement LAG port join/leave")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Nogah Frankel <nogahf@mellanox.com>
Tested-by: Nogah Frankel <nogahf@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 4e587ea71b ]
Commit c5cff8561d adds rcu grace period before freeing fib6_node. This
generates a new sparse warning on rt->rt6i_node related code:
net/ipv6/route.c:1394:30: error: incompatible types in comparison
expression (different address spaces)
./include/net/ip6_fib.h:187:14: error: incompatible types in comparison
expression (different address spaces)
This commit adds "__rcu" tag for rt6i_node and makes sure corresponding
rcu API is used for it.
After this fix, sparse no longer generates the above warning.
Fixes: c5cff8561d ("ipv6: add rcu grace period before freeing fib6_node")
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit c5cff8561d ]
We currently keep rt->rt6i_node pointing to the fib6_node for the route.
And some functions make use of this pointer to dereference the fib6_node
from rt structure, e.g. rt6_check(). However, as there is neither
refcount nor rcu taken when dereferencing rt->rt6i_node, it could
potentially cause crashes as rt->rt6i_node could be set to NULL by other
CPUs when doing a route deletion.
This patch introduces an rcu grace period before freeing fib6_node and
makes sure the functions that dereference it takes rcu_read_lock().
Note: there is no "Fixes" tag because this bug was there in a very
early stage.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 591b6bb605 upstream.
Several legacy devices such as Geode-based Cisco ASA appliances
and DB800 development board do possess CS5536 IDE controller
with different PCI id than existing one. Using pata_generic is
not always feasible as at least DB800 requires MSR quirk from
pata_cs5536 to be used with vendor firmware.
Signed-off-by: Andrey Korolyov <andrey@xdel.ru>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fbf1c41fc0 upstream.
Commit 0a94efb5ac ("workqueue: implicit ordered attribute should be
overridable") introduced a __WQ_ORDERED_EXPLICIT flag but gave it the
same value as __WQ_LEGACY. I don't believe these were intended to
mean the same thing, so renumber __WQ_ORDERED_EXPLICIT.
Fixes: 0a94efb5ac ("workqueue: implicit ordered attribute should be ...")
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b339752d05 upstream.
When !NUMA, cpumask_of_node(@node) equals cpu_online_mask regardless of
@node. The assumption seems that if !NUMA, there shouldn't be more than
one node and thus reporting cpu_online_mask regardless of @node is
correct. However, that assumption was broken years ago to support
DISCONTIGMEM and whether a system has multiple nodes or not is
separately controlled by NEED_MULTIPLE_NODES.
This means that, on a system with !NUMA && NEED_MULTIPLE_NODES,
cpumask_of_node() will report cpu_online_mask for all possible nodes,
indicating that the CPUs are associated with multiple nodes which is an
impossible configuration.
This bug has been around forever but doesn't look like it has caused any
noticeable symptoms. However, it triggers a WARN recently added to
workqueue to verify NUMA affinity configuration.
Fix it by reporting empty cpumask on non-zero nodes if !NUMA.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0cc3b0ec23 upstream.
We have a MAX_LFS_FILESIZE macro that is meant to be filled in by
filesystems (and other IO targets) that know they are 64-bit clean and
don't have any 32-bit limits in their IO path.
It turns out that our 32-bit value for that limit was bogus. On 32-bit,
the VM layer is limited by the page cache to only 32-bit index values,
but our logic for that was confusing and actually wrong. We used to
define that value to
(((loff_t)PAGE_SIZE << (BITS_PER_LONG-1))-1)
which is actually odd in several ways: it limits the index to 31 bits,
and then it limits files so that they can't have data in that last byte
of a page that has the highest 31-bit index (ie page index 0x7fffffff).
Neither of those limitations make sense. The index is actually the full
32 bit unsigned value, and we can use that whole full page. So the
maximum size of the file would logically be "PAGE_SIZE << BITS_PER_LONG".
However, we do wan tto avoid the maximum index, because we have code
that iterates over the page indexes, and we don't want that code to
overflow. So the maximum size of a file on a 32-bit host should
actually be one page less than the full 32-bit index.
So the actual limit is ULONG_MAX << PAGE_SHIFT. That means that we will
not actually be using the page of that last index (ULONG_MAX), but we
can grow a file up to that limit.
The wrong value of MAX_LFS_FILESIZE actually caused problems for Doug
Nazar, who was still using a 32-bit host, but with a 9.7TB 2 x RAID5
volume. It turns out that our old MAX_LFS_FILESIZE was 8TiB (well, one
byte less), but the actual true VM limit is one page less than 16TiB.
This was invisible until commit c2a9737f45 ("vfs,mm: fix a dead loop
in truncate_inode_pages_range()"), which started applying that
MAX_LFS_FILESIZE limit to block devices too.
NOTE! On 64-bit, the page index isn't a limiter at all, and the limit is
actually just the offset type itself (loff_t), which is signed. But for
clarity, on 64-bit, just use the maximum signed value, and don't make
people have to count the number of 'f' characters in the hex constant.
So just use LLONG_MAX for the 64-bit case. That was what the value had
been before too, just written out as a hex constant.
Fixes: c2a9737f45 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()")
Reported-and-tested-by: Doug Nazar <nazard@nazar.ca>
Cc: Andreas Dilger <adilger@dilger.ca>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Dave Kleikamp <shaggy@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit dd86e373e0 upstream.
The package management code in RAPL relies on package mapping being
available before a CPU is started. This changed with:
9d85eb9119 ("x86/smpboot: Make logical package management more robust")
because the ACPI/BIOS information turned out to be unreliable, but that
left RAPL in broken state. This was not noticed because on a regular boot
all CPUs are online before RAPL is initialized.
A possible fix would be to reintroduce the mess which allocates a package
data structure in CPU prepare and when it turns out to already exist in
starting throw it away later in the CPU online callback. But that's a
horrible hack and not required at all because RAPL becomes functional for
perf only in the CPU online callback. That's correct because user space is
not yet informed about the CPU being onlined, so nothing caan rely on RAPL
being available on that particular CPU.
Move the allocation to the CPU online callback and simplify the hotplug
handling. At this point the package mapping is established and correct.
This also adds a missing check for available package data in the
event_init() function.
Reported-by: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Stephane Eranian <eranian@google.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: 9d85eb9119 ("x86/smpboot: Make logical package management more robust")
Link: http://lkml.kernel.org/r/20170131230141.212593966@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[ jwang: backport to 4.9 fix Null pointer deref during hotplug cpu.]
Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit cb87481ee8 upstream.
The .data and .bss sections were modified in the generic linker script to
pull in sections named .data.<C identifier>, which are generated by gcc with
-ffunction-sections and -fdata-sections options.
The problem with this pattern is it can also match section names that Linux
defines explicitly, e.g., .data.unlikely. This can cause Linux sections to
get moved into the wrong place.
The way to avoid this is to use ".." separators for explicit section names
(the dot character is valid in a section name but not a C identifier).
However currently there are sections which don't follow this rule, so for
now just disable the wild card by default.
Example: http://marc.info/?l=linux-arm-kernel&m=150106824024221&w=2
Fixes: b67067f117 ("kbuild: allow archs to select link dead code/data elimination")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 4cabc5b186 ]
Edward reported that there's an issue in min/max value bounds
tracking when signed and unsigned compares both provide hints
on limits when having unknown variables. E.g. a program such
as the following should have been rejected:
0: (7a) *(u64 *)(r10 -8) = 0
1: (bf) r2 = r10
2: (07) r2 += -8
3: (18) r1 = 0xffff8a94cda93400
5: (85) call bpf_map_lookup_elem#1
6: (15) if r0 == 0x0 goto pc+7
R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R10=fp
7: (7a) *(u64 *)(r10 -16) = -8
8: (79) r1 = *(u64 *)(r10 -16)
9: (b7) r2 = -1
10: (2d) if r1 > r2 goto pc+3
R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=0
R2=imm-1,max_value=18446744073709551615,min_align=1 R10=fp
11: (65) if r1 s> 0x1 goto pc+2
R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=0,max_value=1
R2=imm-1,max_value=18446744073709551615,min_align=1 R10=fp
12: (0f) r0 += r1
13: (72) *(u8 *)(r0 +0) = 0
R0=map_value_adj(ks=8,vs=8,id=0),min_value=0,max_value=1 R1=inv,min_value=0,max_value=1
R2=imm-1,max_value=18446744073709551615,min_align=1 R10=fp
14: (b7) r0 = 0
15: (95) exit
What happens is that in the first part ...
8: (79) r1 = *(u64 *)(r10 -16)
9: (b7) r2 = -1
10: (2d) if r1 > r2 goto pc+3
... r1 carries an unsigned value, and is compared as unsigned
against a register carrying an immediate. Verifier deduces in
reg_set_min_max() that since the compare is unsigned and operation
is greater than (>), that in the fall-through/false case, r1's
minimum bound must be 0 and maximum bound must be r2. Latter is
larger than the bound and thus max value is reset back to being
'invalid' aka BPF_REGISTER_MAX_RANGE. Thus, r1 state is now
'R1=inv,min_value=0'. The subsequent test ...
11: (65) if r1 s> 0x1 goto pc+2
... is a signed compare of r1 with immediate value 1. Here,
verifier deduces in reg_set_min_max() that since the compare
is signed this time and operation is greater than (>), that
in the fall-through/false case, we can deduce that r1's maximum
bound must be 1, meaning with prior test, we result in r1 having
the following state: R1=inv,min_value=0,max_value=1. Given that
the actual value this holds is -8, the bounds are wrongly deduced.
When this is being added to r0 which holds the map_value(_adj)
type, then subsequent store access in above case will go through
check_mem_access() which invokes check_map_access_adj(), that
will then probe whether the map memory is in bounds based
on the min_value and max_value as well as access size since
the actual unknown value is min_value <= x <= max_value; commit
fce366a9dd ("bpf, verifier: fix alu ops against map_value{,
_adj} register types") provides some more explanation on the
semantics.
It's worth to note in this context that in the current code,
min_value and max_value tracking are used for two things, i)
dynamic map value access via check_map_access_adj() and since
commit 06c1c04972 ("bpf: allow helpers access to variable memory")
ii) also enforced at check_helper_mem_access() when passing a
memory address (pointer to packet, map value, stack) and length
pair to a helper and the length in this case is an unknown value
defining an access range through min_value/max_value in that
case. The min_value/max_value tracking is /not/ used in the
direct packet access case to track ranges. However, the issue
also affects case ii), for example, the following crafted program
based on the same principle must be rejected as well:
0: (b7) r2 = 0
1: (bf) r3 = r10
2: (07) r3 += -512
3: (7a) *(u64 *)(r10 -16) = -8
4: (79) r4 = *(u64 *)(r10 -16)
5: (b7) r6 = -1
6: (2d) if r4 > r6 goto pc+5
R1=ctx R2=imm0,min_value=0,max_value=0,min_align=2147483648 R3=fp-512
R4=inv,min_value=0 R6=imm-1,max_value=18446744073709551615,min_align=1 R10=fp
7: (65) if r4 s> 0x1 goto pc+4
R1=ctx R2=imm0,min_value=0,max_value=0,min_align=2147483648 R3=fp-512
R4=inv,min_value=0,max_value=1 R6=imm-1,max_value=18446744073709551615,min_align=1
R10=fp
8: (07) r4 += 1
9: (b7) r5 = 0
10: (6a) *(u16 *)(r10 -512) = 0
11: (85) call bpf_skb_load_bytes#26
12: (b7) r0 = 0
13: (95) exit
Meaning, while we initialize the max_value stack slot that the
verifier thinks we access in the [1,2] range, in reality we
pass -7 as length which is interpreted as u32 in the helper.
Thus, this issue is relevant also for the case of helper ranges.
Resetting both bounds in check_reg_overflow() in case only one
of them exceeds limits is also not enough as similar test can be
created that uses values which are within range, thus also here
learned min value in r1 is incorrect when mixed with later signed
test to create a range:
0: (7a) *(u64 *)(r10 -8) = 0
1: (bf) r2 = r10
2: (07) r2 += -8
3: (18) r1 = 0xffff880ad081fa00
5: (85) call bpf_map_lookup_elem#1
6: (15) if r0 == 0x0 goto pc+7
R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R10=fp
7: (7a) *(u64 *)(r10 -16) = -8
8: (79) r1 = *(u64 *)(r10 -16)
9: (b7) r2 = 2
10: (3d) if r2 >= r1 goto pc+3
R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3
R2=imm2,min_value=2,max_value=2,min_align=2 R10=fp
11: (65) if r1 s> 0x4 goto pc+2
R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0
R1=inv,min_value=3,max_value=4 R2=imm2,min_value=2,max_value=2,min_align=2 R10=fp
12: (0f) r0 += r1
13: (72) *(u8 *)(r0 +0) = 0
R0=map_value_adj(ks=8,vs=8,id=0),min_value=3,max_value=4
R1=inv,min_value=3,max_value=4 R2=imm2,min_value=2,max_value=2,min_align=2 R10=fp
14: (b7) r0 = 0
15: (95) exit
This leaves us with two options for fixing this: i) to invalidate
all prior learned information once we switch signed context, ii)
to track min/max signed and unsigned boundaries separately as
done in [0]. (Given latter introduces major changes throughout
the whole verifier, it's rather net-next material, thus this
patch follows option i), meaning we can derive bounds either
from only signed tests or only unsigned tests.) There is still the
case of adjust_reg_min_max_vals(), where we adjust bounds on ALU
operations, meaning programs like the following where boundaries
on the reg get mixed in context later on when bounds are merged
on the dst reg must get rejected, too:
0: (7a) *(u64 *)(r10 -8) = 0
1: (bf) r2 = r10
2: (07) r2 += -8
3: (18) r1 = 0xffff89b2bf87ce00
5: (85) call bpf_map_lookup_elem#1
6: (15) if r0 == 0x0 goto pc+6
R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R10=fp
7: (7a) *(u64 *)(r10 -16) = -8
8: (79) r1 = *(u64 *)(r10 -16)
9: (b7) r2 = 2
10: (3d) if r2 >= r1 goto pc+2
R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3
R2=imm2,min_value=2,max_value=2,min_align=2 R10=fp
11: (b7) r7 = 1
12: (65) if r7 s> 0x0 goto pc+2
R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3
R2=imm2,min_value=2,max_value=2,min_align=2 R7=imm1,max_value=0 R10=fp
13: (b7) r0 = 0
14: (95) exit
from 12 to 15: R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0
R1=inv,min_value=3 R2=imm2,min_value=2,max_value=2,min_align=2 R7=imm1,min_value=1 R10=fp
15: (0f) r7 += r1
16: (65) if r7 s> 0x4 goto pc+2
R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3
R2=imm2,min_value=2,max_value=2,min_align=2 R7=inv,min_value=4,max_value=4 R10=fp
17: (0f) r0 += r7
18: (72) *(u8 *)(r0 +0) = 0
R0=map_value_adj(ks=8,vs=8,id=0),min_value=4,max_value=4 R1=inv,min_value=3
R2=imm2,min_value=2,max_value=2,min_align=2 R7=inv,min_value=4,max_value=4 R10=fp
19: (b7) r0 = 0
20: (95) exit
Meaning, in adjust_reg_min_max_vals() we must also reset range
values on the dst when src/dst registers have mixed signed/
unsigned derived min/max value bounds with one unbounded value
as otherwise they can be added together deducing false boundaries.
Once both boundaries are established from either ALU ops or
compare operations w/o mixing signed/unsigned insns, then they
can safely be added to other regs also having both boundaries
established. Adding regs with one unbounded side to a map value
where the bounded side has been learned w/o mixing ops is
possible, but the resulting map value won't recover from that,
meaning such op is considered invalid on the time of actual
access. Invalid bounds are set on the dst reg in case i) src reg,
or ii) in case dst reg already had them. The only way to recover
would be to perform i) ALU ops but only 'add' is allowed on map
value types or ii) comparisons, but these are disallowed on
pointers in case they span a range. This is fine as only BPF_JEQ
and BPF_JNE may be performed on PTR_TO_MAP_VALUE_OR_NULL registers
which potentially turn them into PTR_TO_MAP_VALUE type depending
on the branch, so only here min/max value cannot be invalidated
for them.
In terms of state pruning, value_from_signed is considered
as well in states_equal() when dealing with adjusted map values.
With regards to breaking existing programs, there is a small
risk, but use-cases are rather quite narrow where this could
occur and mixing compares probably unlikely.
Joint work with Josef and Edward.
[0] https://lists.iovisor.org/pipermail/iovisor-dev/2017-June/000822.html
Fixes: 484611357c ("bpf: allow access into map value arrays")
Reported-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 68a66d149a ]
This important to call qdisc_tree_reduce_backlog() after changing queue
length. Parent qdisc should deactivate class in ->qlen_notify() called from
qdisc_tree_reduce_backlog() but this happens only if qdisc->q.qlen in zero.
Missed class deactivations leads to crashes/warnings at picking packets
from empty qdisc and corrupting state at reactivating this class in future.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Fixes: 86a7996cc8 ("net_sched: introduce qdisc_replace() helper")
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit c780a049f9 ]
While working on yet another syzkaller report, I found
that our IP_MAX_MTU enforcements were not properly done.
gcc seems to reload dev->mtu for min(dev->mtu, IP_MAX_MTU), and
final result can be bigger than IP_MAX_MTU :/
This is a problem because device mtu can be changed on other cpus or
threads.
While this patch does not fix the issue I am working on, it is
probably worth addressing it.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 81fbfe8ada ]
As found by syzkaller, malicious users can set whatever tx_queue_len
on a tun device and eventually crash the kernel.
Lets remove the ALIGN(XXX, SMP_CACHE_BYTES) thing since a small
ring buffer is not fast anyway.
Fixes: 2e0ab8ca83 ("ptr_ring: array based FIFO for pointers")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit dd1c1f2f20 upstream.
This was reported many times, and this was even mentioned in commit
52ee2dfdd4 ("pids: refactor vnr/nr_ns helpers to make them safe") but
somehow nobody bothered to fix the obvious problem: task_tgid_nr_ns() is
not safe because task->group_leader points to nowhere after the exiting
task passes exit_notify(), rcu_read_lock() can not help.
We really need to change __unhash_process() to nullify group_leader,
parent, and real_parent, but this needs some cleanups. Until then we
can turn task_tgid_nr_ns() into another user of __task_pid_nr_ns() and
fix the problem.
Reported-by: Troy Kensinger <tkensinger@google.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3010f87650 upstream.
There is existing use after free bug when deferred struct pages are
enabled:
The memblock_add() allocates memory for the memory array if more than
128 entries are needed. See comment in e820__memblock_setup():
* The bootstrap memblock region count maximum is 128 entries
* (INIT_MEMBLOCK_REGIONS), but EFI might pass us more E820 entries
* than that - so allow memblock resizing.
This memblock memory is freed here:
free_low_memory_core_early()
We access the freed memblock.memory later in boot when deferred pages
are initialized in this path:
deferred_init_memmap()
for_each_mem_pfn_range()
__next_mem_pfn_range()
type = &memblock.memory;
One possible explanation for why this use-after-free hasn't been hit
before is that the limit of INIT_MEMBLOCK_REGIONS has never been
exceeded at least on systems where deferred struct pages were enabled.
Tested by reducing INIT_MEMBLOCK_REGIONS down to 4 from the current 128,
and verifying in qemu that this code is getting excuted and that the
freed pages are sane.
Link: http://lkml.kernel.org/r/1502485554-318703-2-git-send-email-pasha.tatashin@oracle.com
Fixes: 7e18adb4f8 ("mm: meminit: initialise remaining struct pages in parallel with kswapd")
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 978d13d60c upstream.
This patch fixes a bug associated with iscsit_reset_np_thread()
that can occur during parallel configfs rmdir of a single iscsi_np
used across multiple iscsi-target instances, that would result in
hung task(s) similar to below where configfs rmdir process context
was blocked indefinately waiting for iscsi_np->np_restart_comp
to finish:
[ 6726.112076] INFO: task dcp_proxy_node_:15550 blocked for more than 120 seconds.
[ 6726.119440] Tainted: G W O 4.1.26-3321 #2
[ 6726.125045] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6726.132927] dcp_proxy_node_ D ffff8803f202bc88 0 15550 1 0x00000000
[ 6726.140058] ffff8803f202bc88 ffff88085c64d960 ffff88083b3b1ad0 ffff88087fffeb08
[ 6726.147593] ffff8803f202c000 7fffffffffffffff ffff88083f459c28 ffff88083b3b1ad0
[ 6726.155132] ffff88035373c100 ffff8803f202bca8 ffffffff8168ced2 ffff8803f202bcb8
[ 6726.162667] Call Trace:
[ 6726.165150] [<ffffffff8168ced2>] schedule+0x32/0x80
[ 6726.170156] [<ffffffff8168f5b4>] schedule_timeout+0x214/0x290
[ 6726.176030] [<ffffffff810caef2>] ? __send_signal+0x52/0x4a0
[ 6726.181728] [<ffffffff8168d7d6>] wait_for_completion+0x96/0x100
[ 6726.187774] [<ffffffff810e7c80>] ? wake_up_state+0x10/0x10
[ 6726.193395] [<ffffffffa035d6e2>] iscsit_reset_np_thread+0x62/0xe0 [iscsi_target_mod]
[ 6726.201278] [<ffffffffa0355d86>] iscsit_tpg_disable_portal_group+0x96/0x190 [iscsi_target_mod]
[ 6726.210033] [<ffffffffa0363f7f>] lio_target_tpg_store_enable+0x4f/0xc0 [iscsi_target_mod]
[ 6726.218351] [<ffffffff81260c5a>] configfs_write_file+0xaa/0x110
[ 6726.224392] [<ffffffff811ea364>] vfs_write+0xa4/0x1b0
[ 6726.229576] [<ffffffff811eb111>] SyS_write+0x41/0xb0
[ 6726.234659] [<ffffffff8169042e>] system_call_fastpath+0x12/0x71
It would happen because each iscsit_reset_np_thread() sets state
to ISCSI_NP_THREAD_RESET, sends SIGINT, and then blocks waiting
for completion on iscsi_np->np_restart_comp.
However, if iscsi_np was active processing a login request and
more than a single iscsit_reset_np_thread() caller to the same
iscsi_np was blocked on iscsi_np->np_restart_comp, iscsi_np
kthread process context in __iscsi_target_login_thread() would
flush pending signals and only perform a single completion of
np->np_restart_comp before going back to sleep within transport
specific iscsit_transport->iscsi_accept_np code.
To address this bug, add a iscsi_np->np_reset_count and update
__iscsi_target_login_thread() to keep completing np->np_restart_comp
until ->np_reset_count has reached zero.
Reported-by: Gary Guo <ghg@datera.io>
Tested-by: Gary Guo <ghg@datera.io>
Cc: Mike Christie <mchristi@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0a94efb5ac upstream.
5c0338c687 ("workqueue: restore WQ_UNBOUND/max_active==1 to be
ordered") automatically enabled ordered attribute for unbound
workqueues w/ max_active == 1. Because ordered workqueues reject
max_active and some attribute changes, this implicit ordered mode
broke cases where the user creates an unbound workqueue w/ max_active
== 1 and later explicitly changes the related attributes.
This patch distinguishes explicit and implicit ordered setting and
overrides from attribute changes if implict.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 5c0338c687 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered")
Cc: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 2d39b3cd34 ]
Since commit 00cd5c37af ("ptrace: permit ptracing of /sbin/init") we
can now trace init processes. init is initially protected with
SIGNAL_UNKILLABLE which will prevent fatal signals such as SIGSTOP, but
there are a number of paths during tracing where SIGNAL_UNKILLABLE can
be implicitly cleared.
This can result in init becoming stoppable/killable after tracing. For
example, running:
while true; do kill -STOP 1; done &
strace -p 1
and then stopping strace and the kill loop will result in init being
left in state TASK_STOPPED. Sending SIGCONT to init will resume it, but
init will now respond to future SIGSTOP signals rather than ignoring
them.
Make sure that when setting SIGNAL_STOP_CONTINUED/SIGNAL_STOP_STOPPED
that we don't clear SIGNAL_UNKILLABLE.
Link: http://lkml.kernel.org/r/20170104122017.25047-1-jamie.iles@oracle.com
Signed-off-by: Jamie Iles <jamie.iles@oracle.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit bb1107f7c6 ]
Andrey Konovalov has reported the following warning triggered by the
syzkaller fuzzer.
WARNING: CPU: 1 PID: 9935 at mm/page_alloc.c:3511 __alloc_pages_nodemask+0x159c/0x1e20
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 9935 Comm: syz-executor0 Not tainted 4.9.0-rc7+ #34
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
__alloc_pages_slowpath mm/page_alloc.c:3511
__alloc_pages_nodemask+0x159c/0x1e20 mm/page_alloc.c:3781
alloc_pages_current+0x1c7/0x6b0 mm/mempolicy.c:2072
alloc_pages include/linux/gfp.h:469
kmalloc_order+0x1f/0x70 mm/slab_common.c:1015
kmalloc_order_trace+0x1f/0x160 mm/slab_common.c:1026
kmalloc_large include/linux/slab.h:422
__kmalloc+0x210/0x2d0 mm/slub.c:3723
kmalloc include/linux/slab.h:495
ep_write_iter+0x167/0xb50 drivers/usb/gadget/legacy/inode.c:664
new_sync_write fs/read_write.c:499
__vfs_write+0x483/0x760 fs/read_write.c:512
vfs_write+0x170/0x4e0 fs/read_write.c:560
SYSC_write fs/read_write.c:607
SyS_write+0xfb/0x230 fs/read_write.c:599
entry_SYSCALL_64_fastpath+0x1f/0xc2
The issue is caused by a lack of size check for the request size in
ep_write_iter which should be fixed. It, however, points to another
problem, that SLUB defines KMALLOC_MAX_SIZE too large because the its
KMALLOC_SHIFT_MAX is (MAX_ORDER + PAGE_SHIFT) which means that the
resulting page allocator request might be MAX_ORDER which is too large
(see __alloc_pages_slowpath).
The same applies to the SLOB allocator which allows even larger sizes.
Make sure that they are capped properly and never request more than
MAX_ORDER order.
Link: http://lkml.kernel.org/r/20161220130659.16461-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 93be2b7427 upstream.
gcc-7 complains that wl3501_cs passes NULL into a function that
then uses the argument as the input for memcpy:
drivers/net/wireless/wl3501_cs.c: In function 'wl3501_get_scan':
include/net/iw_handler.h:559:3: error: argument 2 null where non-null expected [-Werror=nonnull]
memcpy(stream + point_len, extra, iwe->u.data.length);
This works fine here because iwe->u.data.length is guaranteed to be 0
and the memcpy doesn't actually have an effect.
Making the length check explicit avoids the warning and should have
no other effect here.
Also check the pointer itself, since otherwise we get warnings
elsewhere in the code.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 6b84202c94 ]
Commit b1f5bfc27a ("sctp: don't dereference ptr before leaving
_sctp_walk_{params, errors}()") tried to fix the issue that it
may overstep the chunk end for _sctp_walk_{params, errors} with
'chunk_end > offset(length) + sizeof(length)'.
But it introduced a side effect: When processing INIT, it verifies
the chunks with 'param.v == chunk_end' after iterating all params
by sctp_walk_params(). With the check 'chunk_end > offset(length)
+ sizeof(length)', it would return when the last param is not yet
accessed. Because the last param usually is fwdtsn supported param
whose size is 4 and 'chunk_end == offset(length) + sizeof(length)'
This is a badly issue even causing sctp couldn't process 4-shakes.
Client would always get abort when connecting to server, due to
the failure of INIT chunk verification on server.
The patch is to use 'chunk_end <= offset(length) + sizeof(length)'
instead of 'chunk_end < offset(length) + sizeof(length)' for both
_sctp_walk_params and _sctp_walk_errors.
Fixes: b1f5bfc27a ("sctp: don't dereference ptr before leaving _sctp_walk_{params, errors}()")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 25cdda95fd upstream.
This patch fixes a OOPs originally introduced by:
commit bb048357da
Author: Nicholas Bellinger <nab@linux-iscsi.org>
Date: Thu Sep 5 14:54:04 2013 -0700
iscsi-target: Add sk->sk_state_change to cleanup after TCP failure
which would trigger a NULL pointer dereference when a TCP connection
was closed asynchronously via iscsi_target_sk_state_change(), but only
when the initial PDU processing in iscsi_target_do_login() from iscsi_np
process context was blocked waiting for backend I/O to complete.
To address this issue, this patch makes the following changes.
First, it introduces some common helper functions used for checking
socket closing state, checking login_flags, and atomically checking
socket closing state + setting login_flags.
Second, it introduces a LOGIN_FLAGS_INITIAL_PDU bit to know when a TCP
connection has dropped via iscsi_target_sk_state_change(), but the
initial PDU processing within iscsi_target_do_login() in iscsi_np
context is still running. For this case, it sets LOGIN_FLAGS_CLOSED,
but doesn't invoke schedule_delayed_work().
The original NULL pointer dereference case reported by MNC is now handled
by iscsi_target_do_login() doing a iscsi_target_sk_check_close() before
transitioning to FFP to determine when the socket has already closed,
or iscsi_target_start_negotiation() if the login needs to exchange
more PDUs (eg: iscsi_target_do_login returned 0) but the socket has
closed. For both of these cases, the cleanup up of remaining connection
resources will occur in iscsi_target_start_negotiation() from iscsi_np
process context once the failure is detected.
Finally, to handle to case where iscsi_target_sk_state_change() is
called after the initial PDU procesing is complete, it now invokes
conn->login_work -> iscsi_target_do_login_rx() to perform cleanup once
existing iscsi_target_sk_check_close() checks detect connection failure.
For this case, the cleanup of remaining connection resources will occur
in iscsi_target_do_login_rx() from delayed workqueue process context
once the failure is detected.
Reported-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Tested-by: Mike Christie <mchristi@redhat.com>
Cc: Mike Christie <mchristi@redhat.com>
Reported-by: Hannes Reinecke <hare@suse.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Varun Prakash <varun@chelsio.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 89affbf5d9 upstream.
In codepaths that use the begin/retry interface for reading
mems_allowed_seq with irqs disabled, there exists a race condition that
stalls the patch process after only modifying a subset of the
static_branch call sites.
This problem manifested itself as a deadlock in the slub allocator,
inside get_any_partial. The loop reads mems_allowed_seq value (via
read_mems_allowed_begin), performs the defrag operation, and then
verifies the consistency of mem_allowed via the read_mems_allowed_retry
and the cookie returned by xxx_begin.
The issue here is that both begin and retry first check if cpusets are
enabled via cpusets_enabled() static branch. This branch can be
rewritted dynamically (via cpuset_inc) if a new cpuset is created. The
x86 jump label code fully synchronizes across all CPUs for every entry
it rewrites. If it rewrites only one of the callsites (specifically the
one in read_mems_allowed_retry) and then waits for the
smp_call_function(do_sync_core) to complete while a CPU is inside the
begin/retry section with IRQs off and the mems_allowed value is changed,
we can hang.
This is because begin() will always return 0 (since it wasn't patched
yet) while retry() will test the 0 against the actual value of the seq
counter.
The fix is to use two different static keys: one for begin
(pre_enable_key) and one for retry (enable_key). In cpuset_inc(), we
first bump the pre_enable key to ensure that cpuset_mems_allowed_begin()
always return a valid seqcount if are enabling cpusets. Similarly, when
disabling cpusets via cpuset_dec(), we first ensure that callers of
cpuset_mems_allowed_retry() will start ignoring the seqcount value
before we let cpuset_mems_allowed_begin() return 0.
The relevant stack traces of the two stuck threads:
CPU: 1 PID: 1415 Comm: mkdir Tainted: G L 4.9.36-00104-g540c51286237 #4
Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
task: ffff8817f9c28000 task.stack: ffffc9000ffa4000
RIP: smp_call_function_many+0x1f9/0x260
Call Trace:
smp_call_function+0x3b/0x70
on_each_cpu+0x2f/0x90
text_poke_bp+0x87/0xd0
arch_jump_label_transform+0x93/0x100
__jump_label_update+0x77/0x90
jump_label_update+0xaa/0xc0
static_key_slow_inc+0x9e/0xb0
cpuset_css_online+0x70/0x2e0
online_css+0x2c/0xa0
cgroup_apply_control_enable+0x27f/0x3d0
cgroup_mkdir+0x2b7/0x420
kernfs_iop_mkdir+0x5a/0x80
vfs_mkdir+0xf6/0x1a0
SyS_mkdir+0xb7/0xe0
entry_SYSCALL_64_fastpath+0x18/0xad
...
CPU: 2 PID: 1 Comm: init Tainted: G L 4.9.36-00104-g540c51286237 #4
Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
task: ffff8818087c0000 task.stack: ffffc90000030000
RIP: int3+0x39/0x70
Call Trace:
<#DB> ? ___slab_alloc+0x28b/0x5a0
<EOE> ? copy_process.part.40+0xf7/0x1de0
__slab_alloc.isra.80+0x54/0x90
copy_process.part.40+0xf7/0x1de0
copy_process.part.40+0xf7/0x1de0
kmem_cache_alloc_node+0x8a/0x280
copy_process.part.40+0xf7/0x1de0
_do_fork+0xe7/0x6c0
_raw_spin_unlock_irq+0x2d/0x60
trace_hardirqs_on_caller+0x136/0x1d0
entry_SYSCALL_64_fastpath+0x5/0xad
do_syscall_64+0x27/0x350
SyS_clone+0x19/0x20
do_syscall_64+0x60/0x350
entry_SYSCALL64_slow_path+0x25/0x25
Link: http://lkml.kernel.org/r/20170731040113.14197-1-dmitriyz@waymo.com
Fixes: 46e700abc4 ("mm, page_alloc: remove unnecessary taking of a seqlock when cpusets are disabled")
Signed-off-by: Dima Zavin <dmitriyz@waymo.com>
Reported-by: Cliff Spradlin <cspradlin@waymo.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3ea277194d upstream.
Nadav Amit identified a theoritical race between page reclaim and
mprotect due to TLB flushes being batched outside of the PTL being held.
He described the race as follows:
CPU0 CPU1
---- ----
user accesses memory using RW PTE
[PTE now cached in TLB]
try_to_unmap_one()
==> ptep_get_and_clear()
==> set_tlb_ubc_flush_pending()
mprotect(addr, PROT_READ)
==> change_pte_range()
==> [ PTE non-present - no flush ]
user writes using cached RW PTE
...
try_to_unmap_flush()
The same type of race exists for reads when protecting for PROT_NONE and
also exists for operations that can leave an old TLB entry behind such
as munmap, mremap and madvise.
For some operations like mprotect, it's not necessarily a data integrity
issue but it is a correctness issue as there is a window where an
mprotect that limits access still allows access. For munmap, it's
potentially a data integrity issue although the race is massive as an
munmap, mmap and return to userspace must all complete between the
window when reclaim drops the PTL and flushes the TLB. However, it's
theoritically possible so handle this issue by flushing the mm if
reclaim is potentially currently batching TLB flushes.
Other instances where a flush is required for a present pte should be ok
as either the page lock is held preventing parallel reclaim or a page
reference count is elevated preventing a parallel free leading to
corruption. In the case of page_mkclean there isn't an obvious path
that userspace could take advantage of without using the operations that
are guarded by this patch. Other users such as gup as a race with
reclaim looks just at PTEs. huge page variants should be ok as they
don't race with reclaim. mincore only looks at PTEs. userfault also
should be ok as if a parallel reclaim takes place, it will either fault
the page back in or read some of the data before the flush occurs
triggering a fault.
Note that a variant of this patch was acked by Andy Lutomirski but this
was for the x86 parts on top of his PCID work which didn't make the 4.13
merge window as expected. His ack is dropped from this version and
there will be a follow-on patch on top of PCID that will include his
ack.
[akpm@linux-foundation.org: tweak comments]
[akpm@linux-foundation.org: fix spello]
Link: http://lkml.kernel.org/r/20170717155523.emckq2esjro6hf3z@suse.de
Reported-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fd40559c86 upstream.
The verifier is allocated on the stack, but the EXCHANGE_ID RPC call was
changed to be asynchronous by commit 8d89bd70bc. If we interrrupt
the call to rpc_wait_for_completion_task(), we can therefore end up
transmitting random stack contents in lieu of the verifier.
Fixes: 8d89bd70bc ("NFS setup async exchange_id")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 10b1c04e92 ]
Demoting simple flow steering rule priority (for DPDK) was achieved by
wrapping FW commands MLX4_QP_FLOW_STEERING_ATTACH/DETACH for the PF
as well, and forcing the priority to MLX4_DOMAIN_NIC in the wrapper
function for the PF and all VFs.
In function mlx4_ib_create_flow(), this change caused the main rule
creation for the PF to be wrapped, while it left the associated
tunnel steering rule creation unwrapped for the PF.
This mismatch caused rule deletion failures in mlx4_ib_destroy_flow()
for the PF when the detach wrapper function did not find the associated
tunnel-steering rule (since creation of that rule for the PF did not
go through the wrapper function).
Fix this by setting MLX4_QP_FLOW_STEERING_ATTACH/DETACH to be "native"
(so that the PF invocation does not go through the wrapper), and perform
the required priority demotion for the PF in the mlx4_ib_create_flow()
code path.
Fixes: 48564135cb ("net/mlx4_core: Demote simple multicast and broadcast flow steering rules")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 663deb4788 upstream.
In preparation of not locking at all for certain buffers depending on if
there's contention, make locking optional depending on the initialization
of the prz.
Signed-off-by: Joel Fernandes <joelaf@google.com>
[kees: moved locking flag into prz instead of via caller arguments]
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 49d31c2f38 upstream.
take_dentry_name_snapshot() takes a safe snapshot of dentry name;
if the name is a short one, it gets copied into caller-supplied
structure, otherwise an extra reference to external name is grabbed
(those are never modified). In either case the pointer to stable
string is stored into the same structure.
dentry must be held by the caller of take_dentry_name_snapshot(),
but may be freely dropped afterwards - the snapshot will stay
until destroyed by release_dentry_name_snapshot().
Intended use:
struct name_snapshot s;
take_dentry_name_snapshot(&s, dentry);
...
access s.name
...
release_dentry_name_snapshot(&s);
Replaces fsnotify_oldname_...(), gets used in fsnotify to obtain the name
to pass down with event.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0878fff1f4 upstream.
The Generic PHY driver is a catch-all PHY driver and it should preserve
whatever prior initialization has been done by boot loader or firmware
agents. For specific PHY device configuration it is expected that a
specialized PHY driver would take over that role.
Resetting the generic PHY was a bad idea that has lead to several
complaints and downstream workarounds e.g: in OpenWrt/LEDE so restore
the behavior prior to 87aa9f9c61 ("net: phy: consolidate PHY
reset in phy_init_hw()").
Reported-by: Felix Fietkau <nbd@nbd.name>
Fixes: 87aa9f9c61 ("net: phy: consolidate PHY reset in phy_init_hw()")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 109704492e upstream.
Currently pstore has a global spinlock for all zones. Since the zones
are independent and modify different areas of memory, there's no need
to have a global lock, so we should use a per-zone lock as introduced
here. Also, when ramoops's ftrace use-case has a FTRACE_PER_CPU flag
introduced later, which splits the ftrace memory area into a single zone
per CPU, it will eliminate the need for locking. In preparation for this,
make the locking optional.
Signed-off-by: Joel Fernandes <joelaf@google.com>
[kees: updated commit message]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Leo Yan <leo.yan@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5d6dee80a1 upstream.
At the point where the kvm-vfio pseudo device wants to release its
vfio group reference, we can't always acquire a new reference to make
that happen. The group can be in a state where we wouldn't allow a
new reference to be added. This new helper function allows a caller
to match a file to a group to facilitate this. Given a file and
group, report if they match. Thus the caller needs to already have a
group reference to match to the file. This allows the deletion of a
group without acquiring a new reference.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 138d351eef upstream.
This patch re-introduces part of a long standing login workaround that
was recently dropped by:
commit 1c99de981f
Author: Nicholas Bellinger <nab@linux-iscsi.org>
Date: Sun Apr 2 13:36:44 2017 -0700
iscsi-target: Drop work-around for legacy GlobalSAN initiator
Namely, the workaround for FirstBurstLength ended up being required by
Mellanox Flexboot PXE boot ROMs as reported by Robert.
So this patch re-adds the work-around for FirstBurstLength within
iscsi_check_proposer_for_optional_reply(), and makes the key optional
to respond when the initiator does not propose, nor respond to it.
Also as requested by Arun, this patch introduces a new TPG attribute
named 'login_keys_workaround' that controls the use of both the
FirstBurstLength workaround, as well as the two other existing
workarounds for gPXE iSCSI boot client.
By default, the workaround is enabled with login_keys_workaround=1,
since Mellanox FlexBoot requires it, and Arun has verified the Qlogic
MSFT initiator already proposes FirstBurstLength, so it's uneffected
by this re-adding this part of the original work-around.
Reported-by: Robert LeBlanc <robert@leblancnet.us>
Cc: Robert LeBlanc <robert@leblancnet.us>
Reviewed-by: Arun Easi <arun.easi@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f9279c968c upstream.
The addition of the STARGET_REMOVE state had the side effect of
introducing a race condition that can cause a crash.
scsi_target_reap_ref_release() checks the starget->state to
see if it still in STARGET_CREATED, and if so, skips calling
transport_remove_device() and device_del(), because the starget->state
is only set to STARGET_RUNNING after scsi_target_add() has called
device_add() and transport_add_device().
However, if an rport loss occurs while a target is being scanned,
it can happen that scsi_remove_target() will be called while the
starget is still in the STARGET_CREATED state. In this case, the
starget->state will be set to STARGET_REMOVE, and as a result,
scsi_target_reap_ref_release() will take the wrong path. The end
result is a panic:
[ 1255.356653] Oops: 0000 [#1] SMP
[ 1255.360154] Modules linked in: x86_pkg_temp_thermal kvm_intel kvm irqbypass crc32c_intel ghash_clmulni_i
[ 1255.393234] CPU: 5 PID: 149 Comm: kworker/u96:4 Tainted: G W 4.11.0+ #8
[ 1255.401879] Hardware name: Dell Inc. PowerEdge R320/08VT7V, BIOS 2.0.22 11/19/2013
[ 1255.410327] Workqueue: scsi_wq_6 fc_scsi_scan_rport [scsi_transport_fc]
[ 1255.417720] task: ffff88060ca8c8c0 task.stack: ffffc900048a8000
[ 1255.424331] RIP: 0010:kernfs_find_ns+0x13/0xc0
[ 1255.429287] RSP: 0018:ffffc900048abbf0 EFLAGS: 00010246
[ 1255.435123] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 1255.443083] RDX: 0000000000000000 RSI: ffffffff8188d659 RDI: 0000000000000000
[ 1255.451043] RBP: ffffc900048abc10 R08: 0000000000000000 R09: 0000012433fe0025
[ 1255.459005] R10: 0000000025e5a4b5 R11: 0000000025e5a4b5 R12: ffffffff8188d659
[ 1255.466972] R13: 0000000000000000 R14: ffff8805f55e5088 R15: 0000000000000000
[ 1255.474931] FS: 0000000000000000(0000) GS:ffff880616b40000(0000) knlGS:0000000000000000
[ 1255.483959] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1255.490370] CR2: 0000000000000068 CR3: 0000000001c09000 CR4: 00000000000406e0
[ 1255.498332] Call Trace:
[ 1255.501058] kernfs_find_and_get_ns+0x31/0x60
[ 1255.505916] sysfs_unmerge_group+0x1d/0x60
[ 1255.510498] dpm_sysfs_remove+0x22/0x60
[ 1255.514783] device_del+0xf4/0x2e0
[ 1255.518577] ? device_remove_file+0x19/0x20
[ 1255.523241] attribute_container_class_device_del+0x1a/0x20
[ 1255.529457] transport_remove_classdev+0x4e/0x60
[ 1255.534607] ? transport_add_class_device+0x40/0x40
[ 1255.540046] attribute_container_device_trigger+0xb0/0xc0
[ 1255.546069] transport_remove_device+0x15/0x20
[ 1255.551025] scsi_target_reap_ref_release+0x25/0x40
[ 1255.556467] scsi_target_reap+0x2e/0x40
[ 1255.560744] __scsi_scan_target+0xaa/0x5b0
[ 1255.565312] scsi_scan_target+0xec/0x100
[ 1255.569689] fc_scsi_scan_rport+0xb1/0xc0 [scsi_transport_fc]
[ 1255.576099] process_one_work+0x14b/0x390
[ 1255.580569] worker_thread+0x4b/0x390
[ 1255.584651] kthread+0x109/0x140
[ 1255.588251] ? rescuer_thread+0x330/0x330
[ 1255.592730] ? kthread_park+0x60/0x60
[ 1255.596815] ret_from_fork+0x29/0x40
[ 1255.600801] Code: 24 08 48 83 42 40 01 5b 41 5c 5d c3 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90
[ 1255.621876] RIP: kernfs_find_ns+0x13/0xc0 RSP: ffffc900048abbf0
[ 1255.628479] CR2: 0000000000000068
[ 1255.632756] ---[ end trace 34a69ba0477d036f ]---
Fix this by adding another scsi_target state STARGET_CREATED_REMOVE
to distinguish this case.
Fixes: f05795d3d7 ("scsi: Add intermediate STARGET_REMOVE state to scsi_target_state")
Reported-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Ewan D. Milne <emilne@redhat.com>
Reviewed-by: Laurence Oberman <loberman@redhat.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>