[ Upstream commit 3499ba8198 ]
For a while, event channel notification via the PCI platform device
has been broken, because we attempt to communicate with xenstore before
we even have notifications working, with the xs_reset_watches() call
in xs_init().
We tend to get away with this on Xen versions below 4.0 because we avoid
calling xs_reset_watches() anyway, because xenstore might not cope with
reading a non-existent key. And newer Xen *does* have the vector
callback support, so we rarely fall back to INTX/GSI delivery.
To fix it, clean up a bit of the mess of xs_init() and xenbus_probe()
startup. Call xs_init() directly from xenbus_init() only in the !XS_HVM
case, deferring it to be called from xenbus_probe() in the XS_HVM case
instead.
Then fix up the invocation of xenbus_probe() to happen either from its
device_initcall if the callback is available early enough, or when the
callback is finally set up. This means that the hack of calling
xenbus_probe() from a workqueue after the first interrupt, or directly
from the PCI platform device setup, is no longer needed.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lore.kernel.org/r/20210113132606.422794-2-dwmw2@infradead.org
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c35a824c31 ]
With UBSAN enabled and building with clang, there are occasionally
warnings like
WARNING: modpost: vmlinux.o(.text+0xc533ec): Section mismatch in reference from the function arch_atomic64_or() to the variable .init.data:numa_nodes_parsed
The function arch_atomic64_or() references
the variable __initdata numa_nodes_parsed.
This is often because arch_atomic64_or lacks a __initdata
annotation or the annotation of numa_nodes_parsed is wrong.
for functions that end up not being inlined as intended but operating
on __initdata variables. Mark these as __always_inline, along with
the corresponding asm-generic wrappers.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210108092024.4034860-1-arnd@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 5eee7bd7e2 upstream.
This worked before, because we made all callers name their next pointer
"next". But in trying to be more "drop-in" ready, the silliness here is
revealed. This commit fixes the problem by making the macro argument and
the member use different names.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit dcfea72e79 upstream.
As part of the continual effort to remove direct usage of skb->next and
skb->prev, this patch adds a helper for iterating through the
singly-linked variant of skb lists, which are used for lists of GSO
packet. The name "skb_list_..." has been chosen to match the existing
function, "kfree_skb_list, which also operates on these singly-linked
lists, and the "..._walk_safe" part is the same idiom as elsewhere in
the kernel.
This patch removes the helper from wireguard and puts it into
linux/skbuff.h, while making it a bit more robust for general usage. In
particular, parenthesis are added around the macro argument usage, and it
now accounts for trying to iterate through an already-null skb pointer,
which will simply run the iteration zero times. This latter enhancement
means it can be used to replace both do { ... } while and while (...)
open-coded idioms.
This should take care of these three possible usages, which match all
current methods of iterations.
skb_list_walk_safe(segs, skb, next) { ... }
skb_list_walk_safe(skb, skb, next) { ... }
skb_list_walk_safe(segs, skb, segs) { ... }
Gcc appears to generate efficient code for each of these.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[ Just the skbuff.h changes for backporting - gregkh]
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit ee61cfd955 ]
It adds a stub acpi_create_platform_device() for !CONFIG_ACPI build, so
that caller doesn't have to deal with !CONFIG_ACPI build issue.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Shawn Guo <shawn.guo@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 9b5948267a ]
With external metadata device, flush requests are not passed down to the
data device.
Fix this by submitting the flush request in dm_integrity_flush_buffers. In
order to not degrade performance, we overlap the data device flush with
the metadata device flush.
Reported-by: Lukas Straub <lukasstraub2@web.de>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit eff8728fe6 upstream.
Basically, consider .text.{hot|unlikely|unknown}.* part of .text, too.
When compiling with profiling information (collected via PGO
instrumentations or AutoFDO sampling), Clang will separate code into
.text.hot, .text.unlikely, or .text.unknown sections based on profiling
information. After D79600 (clang-11), these sections will have a
trailing `.` suffix, ie. .text.hot., .text.unlikely., .text.unknown..
When using -ffunction-sections together with profiling infomation,
either explicitly (FGKASLR) or implicitly (LTO), code may be placed in
sections following the convention:
.text.hot.<foo>, .text.unlikely.<bar>, .text.unknown.<baz>
where <foo>, <bar>, and <baz> are functions. (This produces one section
per function; we generally try to merge these all back via linker script
so that we don't have 50k sections).
For the above cases, we need to teach our linker scripts that such
sections might exist and that we'd explicitly like them grouped
together, otherwise we can wind up with code outside of the
_stext/_etext boundaries that might not be mapped properly for some
architectures, resulting in boot failures.
If the linker script is not told about possible input sections, then
where the section is placed as output is a heuristic-laiden mess that's
non-portable between linkers (ie. BFD and LLD), and has resulted in many
hard to debug bugs. Kees Cook is working on cleaning this up by adding
--orphan-handling=warn linker flag used in ARCH=powerpc to additional
architectures. In the case of linker scripts, borrowing from the Zen of
Python: explicit is better than implicit.
Also, ld.bfd's internal linker script considers .text.hot AND
.text.hot.* to be part of .text, as well as .text.unlikely and
.text.unlikely.*. I didn't see support for .text.unknown.*, and didn't
see Clang producing such code in our kernel builds, but I see code in
LLVM that can produce such section names if profiling information is
missing. That may point to a larger issue with generating or collecting
profiles, but I would much rather be safe and explicit than have to
debug yet another issue related to orphan section placement.
Reported-by: Jian Cai <jiancai@google.com>
Suggested-by: Fāng-ruì Sòng <maskray@google.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Luis Lozano <llozano@google.com>
Tested-by: Manoj Gupta <manojgupta@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: linux-arch@vger.kernel.org
Cc: stable@vger.kernel.org
Link: https://sourceware.org/git/?p=binutils-gdb.git;a=commitdiff;h=add44f8d5c5c05e08b11e033127a744d61c26aee
Link: https://sourceware.org/git/?p=binutils-gdb.git;a=commitdiff;h=1de778ed23ce7492c523d5850c6c6dbb34152655
Link: https://reviews.llvm.org/D79600
Link: https://bugs.chromium.org/p/chromium/issues/detail?id=1084760
Link: https://lore.kernel.org/r/20200821194310.3089815-7-keescook@chromium.org
Debugged-by: Luis Lozano <llozano@google.com>
[nc: Resolve small conflict due to lack of NOINSTR_TEXT]
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 7d6e132965 ]
The following functional changes were needed for backport:
- vfio_iommu_type1_get_info doesn't exist, call
vfio_iommu_dma_avail_build_caps from vfio_iommu_type1_ioctl.
- As further fallout from this, vfio_iommu_dma_avail_build_caps must
acquire and release the iommu mutex lock. To do so, the return value is
stored in a local variable as in vfio_iommu_iova_build_caps.
Upstream commit description:
Commit 492855939b ("vfio/type1: Limit DMA mappings per container")
added the ability to limit the number of memory backed DMA mappings.
However on s390x, when lazy mapping is in use, we use a very large
number of concurrent mappings. Let's provide the current allowable
number of DMA mappings to userspace via the IOMMU info chain so that
userspace can take appropriate mitigation.
Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c6c75deda8 ]
Commit 1fde6f21d9 ("proc: fix /proc/net/* after setns(2)") only forced
revalidation of regular files under /proc/net/
However, /proc/net/ is unusual in the sense of /proc/net/foo handlers
take netns pointer from parent directory which is old netns.
Steps to reproduce:
(void)open("/proc/net/sctp/snmp", O_RDONLY);
unshare(CLONE_NEWNET);
int fd = open("/proc/net/sctp/snmp", O_RDONLY);
read(fd, &c, 1);
Read will read wrong data from original netns.
Patch forces lookup on every directory under /proc/net .
Link: https://lkml.kernel.org/r/20201205160916.GA109739@localhost.localdomain
Fixes: 1da4d377f9 ("proc: revalidate misc dentries")
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reported-by: "Rantala, Tommi T. (Nokia - FI/Espoo)" <tommi.t.rantala@nokia.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 31784cff7e ]
In preparation for converting exec_update_mutex to a rwsem so that
multiple readers can execute in parallel and not deadlock, add
down_read_interruptible. This is needed for perf_event_open to be
converted (with no semantic changes) from working on a mutex to
wroking on a rwsem.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/87k0tybqfy.fsf@x220.int.ebiederm.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0f9368b5bf ]
In preparation for converting exec_update_mutex to a rwsem so that
multiple readers can execute in parallel and not deadlock, add
down_read_killable_nested. This is needed so that kcmp_lock
can be converted from working on a mutexes to working on rw_semaphores.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/87o8jabqh3.fsf@x220.int.ebiederm.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 5812b32e01 upstream.
Specify type alignment when declaring linker-section match-table entries
to prevent gcc from increasing alignment and corrupting the various
tables with padding (e.g. timers, irqchips, clocks, reserved memory).
This is specifically needed on x86 where gcc (typically) aligns larger
objects like struct of_device_id with static extent on 32-byte
boundaries which at best prevents matching on anything but the first
entry. Specifying alignment when declaring variables suppresses this
optimisation.
Here's a 64-bit example where all entries are corrupt as 16 bytes of
padding has been inserted before the first entry:
ffffffff8266b4b0 D __clk_of_table
ffffffff8266b4c0 d __of_table_fixed_factor_clk
ffffffff8266b5a0 d __of_table_fixed_clk
ffffffff8266b680 d __clk_of_table_sentinel
And here's a 32-bit example where the 8-byte-aligned table happens to be
placed on a 32-byte boundary so that all but the first entry are corrupt
due to the 28 bytes of padding inserted between entries:
812b3ec0 D __irqchip_of_table
812b3ec0 d __of_table_irqchip1
812b3fa0 d __of_table_irqchip2
812b4080 d __of_table_irqchip3
812b4160 d irqchip_of_match_end
Verified on x86 using gcc-9.3 and gcc-4.9 (which uses 64-byte
alignment), and on arm using gcc-7.2.
Note that there are no in-tree users of these tables on x86 currently
(even if they are included in the image).
Fixes: 54196ccbe0 ("of: consolidate linker section OF match table declarations")
Fixes: f6e916b820 ("irqchip: add basic infrastructure")
Cc: stable <stable@vger.kernel.org> # 3.9
Signed-off-by: Johan Hovold <johan@kernel.org>
Link: https://lore.kernel.org/r/20201123102319.8090-2-johan@kernel.org
[ johan: adjust context to 5.4 ]
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a85cbe6159 upstream.
and include <linux/const.h> in UAPI headers instead of <linux/kernel.h>.
The reason is to avoid indirect <linux/sysinfo.h> include when using
some network headers: <linux/netlink.h> or others -> <linux/kernel.h>
-> <linux/sysinfo.h>.
This indirect include causes on MUSL redefinition of struct sysinfo when
included both <sys/sysinfo.h> and some of UAPI headers:
In file included from x86_64-buildroot-linux-musl/sysroot/usr/include/linux/kernel.h:5,
from x86_64-buildroot-linux-musl/sysroot/usr/include/linux/netlink.h:5,
from ../include/tst_netlink.h:14,
from tst_crypto.c:13:
x86_64-buildroot-linux-musl/sysroot/usr/include/linux/sysinfo.h:8:8: error: redefinition of `struct sysinfo'
struct sysinfo {
^~~~~~~
In file included from ../include/tst_safe_macros.h:15,
from ../include/tst_test.h:93,
from tst_crypto.c:11:
x86_64-buildroot-linux-musl/sysroot/usr/include/sys/sysinfo.h:10:8: note: originally defined here
Link: https://lkml.kernel.org/r/20201015190013.8901-1-petr.vorel@gmail.com
Signed-off-by: Petr Vorel <petr.vorel@gmail.com>
Suggested-by: Rich Felker <dalias@aerifal.cx>
Acked-by: Rich Felker <dalias@libc.org>
Cc: Peter Korsgaard <peter@korsgaard.com>
Cc: Baruch Siach <baruch@tkos.co.il>
Cc: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3ceb6543e9 upstream.
There isn't really any valid reason to use __FSCRYPT_MODE_MAX or
FSCRYPT_POLICY_FLAGS_VALID in a userspace program. These constants are
only meant to be used by the kernel internally, and they are defined in
the UAPI header next to the mode numbers and flags only so that kernel
developers don't forget to update them when adding new modes or flags.
In https://lkml.kernel.org/r/20201005074133.1958633-2-satyat@google.com
there was an example of someone wanting to use __FSCRYPT_MODE_MAX in a
user program, and it was wrong because the program would have broken if
__FSCRYPT_MODE_MAX were ever increased. So having this definition
available is harmful. FSCRYPT_POLICY_FLAGS_VALID has the same problem.
So, remove these definitions from the UAPI header. Replace
FSCRYPT_POLICY_FLAGS_VALID with just listing the valid flags explicitly
in the one kernel function that needs it. Move __FSCRYPT_MODE_MAX to
fscrypt_private.h, remove the double underscores (which were only
present to discourage use by userspace), and add a BUILD_BUG_ON() and
comments to (hopefully) ensure it is kept in sync.
Keep the old name FS_POLICY_FLAGS_VALID, since it's been around for
longer and there's a greater chance that removing it would break source
compatibility with some program. Indeed, mtd-utils is using it in
an #ifdef, and removing it would introduce compiler warnings (about
FS_POLICY_FLAGS_PAD_* being redefined) into the mtd-utils build.
However, reduce its value to 0x07 so that it only includes the flags
with old names (the ones present before Linux 5.4), and try to make it
clear that it's now "frozen" and no new flags should be added to it.
Fixes: 2336d0deb2 ("fscrypt: use FSCRYPT_ prefix for uapi constants")
Cc: <stable@vger.kernel.org> # v5.4+
Link: https://lore.kernel.org/r/20201024005132.495952-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 159e1de201 upstream.
It's possible to create a duplicate filename in an encrypted directory
by creating a file concurrently with adding the encryption key.
Specifically, sys_open(O_CREAT) (or sys_mkdir(), sys_mknod(), or
sys_symlink()) can lookup the target filename while the directory's
encryption key hasn't been added yet, resulting in a negative no-key
dentry. The VFS then calls ->create() (or ->mkdir(), ->mknod(), or
->symlink()) because the dentry is negative. Normally, ->create() would
return -ENOKEY due to the directory's key being unavailable. However,
if the key was added between the dentry lookup and ->create(), then the
filesystem will go ahead and try to create the file.
If the target filename happens to already exist as a normal name (not a
no-key name), a duplicate filename may be added to the directory.
In order to fix this, we need to fix the filesystems to prevent
->create(), ->mkdir(), ->mknod(), and ->symlink() on no-key names.
(->rename() and ->link() need it too, but those are already handled
correctly by fscrypt_prepare_rename() and fscrypt_prepare_link().)
In preparation for this, add a helper function fscrypt_is_nokey_name()
that filesystems can use to do this check. Use this helper function for
the existing checks that fs/crypto/ does for rename and link.
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20201118075609.120337-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3dc86ca6b4 upstream.
This commit adds a counter of pending messages for each watch in the
struct. It is used to skip unnecessary pending messages lookup in
'unregister_xenbus_watch()'. It could also be used in 'will_handle'
callback.
This is part of XSA-349
Cc: stable@vger.kernel.org
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Reported-by: Michael Kurth <mku@amazon.de>
Reported-by: Pawel Wieczorkiewicz <wipawel@amazon.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2e85d32b1c upstream.
Some code does not directly make 'xenbus_watch' object and call
'register_xenbus_watch()' but use 'xenbus_watch_path()' instead. This
commit adds support of 'will_handle' callback in the
'xenbus_watch_path()' and it's wrapper, 'xenbus_watch_pathfmt()'.
This is part of XSA-349
Cc: stable@vger.kernel.org
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Reported-by: Michael Kurth <mku@amazon.de>
Reported-by: Pawel Wieczorkiewicz <wipawel@amazon.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fed1755b11 upstream.
If handling logics of watch events are slower than the events enqueue
logic and the events can be created from the guests, the guests could
trigger memory pressure by intensively inducing the events, because it
will create a huge number of pending events that exhausting the memory.
Fortunately, some watch events could be ignored, depending on its
handler callback. For example, if the callback has interest in only one
single path, the watch wouldn't want multiple pending events. Or, some
watches could ignore events to same path.
To let such watches to volutarily help avoiding the memory pressure
situation, this commit introduces new watch callback, 'will_handle'. If
it is not NULL, it will be called for each new event just before
enqueuing it. Then, if the callback returns false, the event will be
discarded. No watch is using the callback for now, though.
This is part of XSA-349
Cc: stable@vger.kernel.org
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Reported-by: Michael Kurth <mku@amazon.de>
Reported-by: Pawel Wieczorkiewicz <wipawel@amazon.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7482c5cb90 upstream.
The idea behind acpi_pm_set_bridge_wakeup() was to allow bridges to
be reference counted for wakeup enabling, because they may be enabled
to signal wakeup on behalf of their subordinate devices and that
may happen for multiple times in a row, whereas for the other devices
it only makes sense to enable wakeup signaling once.
However, this becomes problematic if the bridge itself is suspended,
because it is treated as a "regular" device in that case and the
reference counting doesn't work.
For instance, suppose that there are two devices below a bridge and
they both can signal wakeup. Every time one of them is suspended,
wakeup signaling is enabled for the bridge, so when they both have
been suspended, the bridge's wakeup reference counter value is 2.
Say that the bridge is suspended subsequently and acpi_pci_wakeup()
is called for it. Because the bridge can signal wakeup, that
function will invoke acpi_pm_set_device_wakeup() to configure it
and __acpi_pm_set_device_wakeup() will be called with the last
argument equal to 1. This causes __acpi_device_wakeup_enable()
invoked by it to omit the reference counting, because the reference
counter of the target device (the bridge) is 2 at that time.
Now say that the bridge resumes and one of the device below it
resumes too, so the bridge's reference counter becomes 0 and
wakeup signaling is disabled for it, but there is still the other
suspended device which may need the bridge to signal wakeup on its
behalf and that is not going to work.
To address this scenario, use wakeup enable reference counting for
all devices, not just for bridges, so drop the last argument from
__acpi_device_wakeup_enable() and __acpi_pm_set_device_wakeup(),
which causes acpi_pm_set_device_wakeup() and
acpi_pm_set_bridge_wakeup() to become identical, so drop the latter
and use the former instead of it everywhere.
Fixes: 1ba51a7c14 ("ACPI / PCI / PM: Rework acpi_pci_propagate_wakeup()")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: 4.14+ <stable@vger.kernel.org> # 4.14+
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit ed9b25d197 ]
Namespaced file capabilities were introduced in 8db6c34f1d .
When userspace reads an xattr for a namespaced capability, a
virtualized representation of it is returned if the caller is
in a user namespace owned by the capability's owning rootid.
The function which performs this virtualization was not hooked
up if CONFIG_SECURITY=n. Therefore in that case the original
xattr was shown instead of the virtualized one.
To test this using libcap-bin (*1),
$ v=$(mktemp)
$ unshare -Ur setcap cap_sys_admin-eip $v
$ unshare -Ur setcap -v cap_sys_admin-eip $v
/tmp/tmp.lSiIFRvt8Y: OK
"setcap -v" verifies the values instead of setting them, and
will check whether the rootid value is set. Therefore, with
this bug un-fixed, and with CONFIG_SECURITY=n, setcap -v will
fail:
$ v=$(mktemp)
$ unshare -Ur setcap cap_sys_admin=eip $v
$ unshare -Ur setcap -v cap_sys_admin=eip $v
nsowner[got=1000, want=0],/tmp/tmp.HHDiOOl9fY differs in []
Fix this bug by calling cap_inode_getsecurity() in
security_inode_getsecurity() instead of returning
-EOPNOTSUPP, when CONFIG_SECURITY=n.
*1 - note, if libcap is too old for getcap to have the '-n'
option, then use verify-caps instead.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=209689
Cc: Hervé Guillemet <herve@guillemet.org>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Serge Hallyn <shallyn@cisco.com>
Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
Signed-off-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit d9a9280a0d ]
Building with W=2 prints a number of warnings for one function that
has a pointer type mismatch:
linux/seq_buf.h: In function 'seq_buf_init':
linux/seq_buf.h:35:12: warning: pointer targets in assignment from 'unsigned char *' to 'char *' differ in signedness [-Wpointer-sign]
Change the type in the function prototype according to the type in
the structure.
Link: https://lkml.kernel.org/r/20201026161108.3707783-1-arnd@kernel.org
Fixes: 9a7777935c ("tracing: Convert seq_buf fields to be like seq_file fields")
Reviewed-by: Cezary Rojewski <cezary.rojewski@intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit d5aa6b22e2 ]
According to RFC5666, the correct netid for an IPv6 addressed RDMA
transport is "rdma6", which we've supported as a mount option since
Linux-4.7. The problem is when we try to load the module "xprtrdma6",
that will fail, since there is no modulealias of that name.
Fixes: 181342c5eb ("xprtrdma: Add rdma6 option to support NFS/RDMA IPv6")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 92eb6c3060 upstream.
Commit 3f69cc6076 ("crypto: af_alg - Allow arbitrarily long algorithm
names") made the kernel start accepting arbitrarily long algorithm names
in sockaddr_alg. However, the actual length of the salg_name field
stayed at the original 64 bytes.
This is broken because the kernel can access indices >= 64 in salg_name,
which is undefined behavior -- even though the memory that is accessed
is still located within the sockaddr structure. It would only be
defined behavior if the array were properly marked as arbitrary-length
(either by making it a flexible array, which is the recommended way
these days, or by making it an array of length 0 or 1).
We can't simply change salg_name into a flexible array, since that would
break source compatibility with userspace programs that embed
sockaddr_alg into another struct, or (more commonly) declare a
sockaddr_alg like 'struct sockaddr_alg sa = { .salg_name = "foo" };'.
One solution would be to change salg_name into a flexible array only
when '#ifdef __KERNEL__'. However, that would keep userspace without an
easy way to actually use the longer algorithm names.
Instead, add a new structure 'sockaddr_alg_new' that has the flexible
array field, and expose it to both userspace and the kernel.
Make the kernel use it correctly in alg_bind().
This addresses the syzbot report
"UBSAN: array-index-out-of-bounds in alg_bind"
(https://syzkaller.appspot.com/bug?extid=92ead4eb8e26a26d465e).
Reported-by: syzbot+92ead4eb8e26a26d465e@syzkaller.appspotmail.com
Fixes: 3f69cc6076 ("crypto: af_alg - Allow arbitrarily long algorithm names")
Cc: <stable@vger.kernel.org> # v4.12+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 1fa5cef283 ]
refcount of rx_buffer page will be added here originally, so prefetchw
is needed, but after commit 1793668c3b ("i40e/i40evf: Update code to
better handle incrementing page count"), and refcount is not added
every time, so change prefetchw as prefetch.
Now it mainly services page_address(), but which accesses struct page
only when WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL is defined otherwise
it returns address based on offset, so we prefetch it conditionally.
Jakub suggested to define prefetch_page_address in a common header.
Reported-by: kernel test robot <lkp@intel.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 917d80d376 ]
Use nf_msecs_to_jiffies64 and nf_jiffies64_to_msecs as provided by
8e1102d5a1 ("netfilter: nf_tables: support timeouts larger than 23
days"), otherwise ruleset listing breaks.
Fixes: a8b1e36d0d ("netfilter: nft_dynset: fix element timeout for HZ != 1000")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ffe8923f10 ]
Pablo Neira found that after recent update of xt_IDLETIMER the
iptables-nft tests sometimes show an error.
He tracked this down to the delayed cleanup used by nf_tables core:
del rule (transaction A)
add rule (transaction B)
Its possible that by time transaction B (both in same netns) runs,
the xt target destructor has not been invoked yet.
For native nft expressions this is no problem because all expressions
that have such side effects make sure these are handled from the commit
phase, rather than async cleanup.
For nft_compat however this isn't true.
Instead of forcing synchronous behaviour for nft_compat, keep track
of the number of outstanding destructor calls.
When we attempt to create a new expression, flush the cleanup worker
to make sure destructors have completed.
With lots of help from Pablo Neira.
Reported-by: Pablo Neira Ayso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit cc00bcaa58 ]
When running concurrent iptables rules replacement with data, the per CPU
sequence count is checked after the assignment of the new information.
The sequence count is used to synchronize with the packet path without the
use of any explicit locking. If there are any packets in the packet path using
the table information, the sequence count is incremented to an odd value and
is incremented to an even after the packet process completion.
The new table value assignment is followed by a write memory barrier so every
CPU should see the latest value. If the packet path has started with the old
table information, the sequence counter will be odd and the iptables
replacement will wait till the sequence count is even prior to freeing the
old table info.
However, this assumes that the new table information assignment and the memory
barrier is actually executed prior to the counter check in the replacement
thread. If CPU decides to execute the assignment later as there is no user of
the table information prior to the sequence check, the packet path in another
CPU may use the old table information. The replacement thread would then free
the table information under it leading to a use after free in the packet
processing context-
Unable to handle kernel NULL pointer dereference at virtual
address 000000000000008e
pc : ip6t_do_table+0x5d0/0x89c
lr : ip6t_do_table+0x5b8/0x89c
ip6t_do_table+0x5d0/0x89c
ip6table_filter_hook+0x24/0x30
nf_hook_slow+0x84/0x120
ip6_input+0x74/0xe0
ip6_rcv_finish+0x7c/0x128
ipv6_rcv+0xac/0xe4
__netif_receive_skb+0x84/0x17c
process_backlog+0x15c/0x1b8
napi_poll+0x88/0x284
net_rx_action+0xbc/0x23c
__do_softirq+0x20c/0x48c
This could be fixed by forcing instruction order after the new table
information assignment or by switching to RCU for the synchronization.
Fixes: 80055dab5d ("netfilter: x_tables: make xt_replace_table wait until old rules are not used anymore")
Reported-by: Sean Tranchetti <stranche@codeaurora.org>
Reported-by: kernel test robot <lkp@intel.com>
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 3347acc6fc upstream.
Commit 815f0ddb34 ("include/linux/compiler*.h: make compiler-*.h
mutually exclusive") neglected to copy barrier_data() from
compiler-gcc.h into compiler-clang.h.
The definition in compiler-gcc.h was really to work around clang's more
aggressive optimization, so this broke barrier_data() on clang, and
consequently memzero_explicit() as well.
For example, this results in at least the memzero_explicit() call in
lib/crypto/sha256.c:sha256_transform() being optimized away by clang.
Fix this by moving the definition of barrier_data() into compiler.h.
Also move the gcc/clang definition of barrier() into compiler.h,
__memory_barrier() is icc-specific (and barrier() is already defined
using it in compiler-intel.h) and doesn't belong in compiler.h.
[rdunlap@infradead.org: fix ALPHA builds when SMP is not enabled]
Link: https://lkml.kernel.org/r/20201101231835.4589-1-rdunlap@infradead.org
Fixes: 815f0ddb34 ("include/linux/compiler*.h: make compiler-*.h mutually exclusive")
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20201014212631.207844-1-nivedita@alum.mit.edu
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[nd: backport to account for missing
commit e506ea4512 ("compiler.h: Split {READ,WRITE}_ONCE definitions out into rwonce.h")
commit d08b9f0ca6 ("scs: Add support for Clang's Shadow Call Stack (SCS)")]
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e91d8d7823 upstream.
While I was doing zram testing, I found sometimes decompression failed
since the compression buffer was corrupted. With investigation, I found
below commit calls cond_resched unconditionally so it could make a
problem in atomic context if the task is reschedule.
BUG: sleeping function called from invalid context at mm/vmalloc.c:108
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 946, name: memhog
3 locks held by memhog/946:
#0: ffff9d01d4b193e8 (&mm->mmap_lock#2){++++}-{4:4}, at: __mm_populate+0x103/0x160
#1: ffffffffa3d53de0 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0xa98/0x1160
#2: ffff9d01d56b8110 (&zspage->lock){.+.+}-{3:3}, at: zs_map_object+0x8e/0x1f0
CPU: 0 PID: 946 Comm: memhog Not tainted 5.9.3-00011-gc5bfc0287345-dirty #316
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
Call Trace:
unmap_kernel_range_noflush+0x2eb/0x350
unmap_kernel_range+0x14/0x30
zs_unmap_object+0xd5/0xe0
zram_bvec_rw.isra.0+0x38c/0x8e0
zram_rw_page+0x90/0x101
bdev_write_page+0x92/0xe0
__swap_writepage+0x94/0x4a0
pageout+0xe3/0x3a0
shrink_page_list+0xb94/0xd60
shrink_inactive_list+0x158/0x460
We can fix this by removing the ZSMALLOC_PGTABLE_MAPPING feature (which
contains the offending calling code) from zsmalloc.
Even though this option showed some amount improvement(e.g., 30%) in
some arm32 platforms, it has been headache to maintain since it have
abused APIs[1](e.g., unmap_kernel_range in atomic context).
Since we are approaching to deprecate 32bit machines and already made
the config option available for only builtin build since v5.8, lastly it
has been not default option in zsmalloc, it's time to drop the option
for better maintenance.
[1] http://lore.kernel.org/linux-mm/20201105170249.387069-1-minchan@kernel.org
Fixes: e47110e905 ("mm/vunmap: add cond_resched() in vunmap_pmd_range")
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Tony Lindgren <tony@atomide.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Harish Sriram <harish@linux.ibm.com>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20201117202916.GA3856507@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3c78e9e0d3 upstream.
This patch adds nft_flow_rule_set_addr_type() to set the address type
from the nft_payload expression accordingly.
If the address type is not set in the control dissector then a rule that
matches either on source or destination IP address does not work.
After this patch, nft hardware offload generates the flow dissector
configuration as tc-flower does to match on an IP address.
This patch has been also tested functionally to make sure packets are
filtered out by the NIC.
This is also getting the code aligned with the existing netfilter flow
offload infrastructure which is also setting the control dissector.
Fixes: c9626a2cbd ("netfilter: nf_tables: add hardware offload support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit bb4c6910c8 upstream.
There is currently no way to convey the affinity of an interrupt
via irq_create_mapping(), which creates issues for devices that
expect that affinity to be managed by the kernel.
In order to sort this out, rename irq_create_mapping() to
irq_create_mapping_affinity() with an additional affinity parameter that
can be passed down to irq_domain_alloc_descs().
irq_create_mapping() is re-implemented as a wrapper around
irq_create_mapping_affinity().
No functional change.
Fixes: e75eafb9b0 ("genirq/msi: Switch to new irq spreading infrastructure")
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kurz <groug@kaod.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20201126082852.1178497-2-lvivier@redhat.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c8bcd9c5be upstream.
Currently, locking of ->session is very inconsistent; most places
protect it using the legacy tty mutex, but disassociate_ctty(),
__do_SAK(), tiocspgrp() and tiocgsid() don't.
Two of the writers hold the ctrl_lock (because they already need it for
->pgrp), but __proc_set_tty() doesn't do that yet.
On a PREEMPT=y system, an unprivileged user can theoretically abuse
this broken locking to read 4 bytes of freed memory via TIOCGSID if
tiocgsid() is preempted long enough at the right point. (Other things
might also go wrong, especially if root-only ioctls are involved; I'm
not sure about that.)
Change the locking on ->session such that:
- tty_lock() is held by all writers: By making disassociate_ctty()
hold it. This should be fine because the same lock can already be
taken through the call to tty_vhangup_session().
The tricky part is that we need to shorten the area covered by
siglock to be able to take tty_lock() without ugly retry logic; as
far as I can tell, this should be fine, since nothing in the
signal_struct is touched in the `if (tty)` branch.
- ctrl_lock is held by all writers: By changing __proc_set_tty() to
hold the lock a little longer.
- All readers that aren't holding tty_lock() hold ctrl_lock: By
adding locking to tiocgsid() and __do_SAK(), and expanding the area
covered by ctrl_lock in tiocspgrp().
Cc: stable@kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Jiri Slaby <jirislaby@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit d421e466c2 ]
STEs format for Connect-X5 and Connect-X6DX different. Currently, on
Connext-X6DX the SW steering would break at some point when building STEs
w/o giving a proper error message. Fix this by checking the STE format of
the current device when initializing domain: add mlx5_ifc definitions for
Connect-X6DX SW steering, read FW capability to get the current format
version, and check this version when domain is being created.
Fixes: 26d688e33f ("net/mlx5: DR, Add Steering entry (STE) utilities")
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 2867e1eac6 ]
When adding support for propagating ECT(1) marking in IP headers it seems I
suffered from endianness-confusion in the checksum update calculation: In
fact the ECN field is in the *lower* bits of the first 16-bit word of the
IP header when calculating in network byte order. This means that the
addition performed to update the checksum field was wrong; let's fix that.
Fixes: b723748750 ("tunnel: Propagate ECT(1) when decapsulating as recommended by RFC6040")
Reported-by: Jonathan Morton <chromatix99@gmail.com>
Tested-by: Pete Heist <pete@heistp.net>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20201130183705.17540-1-toke@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 025cc2fb6a ]
tls_device_offload_cleanup_rx doesn't clear tls_ctx->netdev after
calling tls_dev_del if TLX TX offload is also enabled. Clearing
tls_ctx->netdev gets postponed until tls_device_gc_task. It leaves a
time frame when tls_device_down may get called and call tls_dev_del for
RX one extra time, confusing the driver, which may lead to a crash.
This patch corrects this racy behavior by adding a flag to prevent
tls_device_down from calling tls_dev_del the second time.
Fixes: e8f6979981 ("net/tls: Add generic NIC offload infrastructure")
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20201125221810.69870-1-saeedm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit cef3970381 ]
Stefan Agner reported a bug when using zsram on 32-bit Arm machines
with RAM above the 4GB address boundary:
Unable to handle kernel NULL pointer dereference at virtual address 00000000
pgd = a27bd01c
[00000000] *pgd=236a0003, *pmd=1ffa64003
Internal error: Oops: 207 [#1] SMP ARM
Modules linked in: mdio_bcm_unimac(+) brcmfmac cfg80211 brcmutil raspberrypi_hwmon hci_uart crc32_arm_ce bcm2711_thermal phy_generic genet
CPU: 0 PID: 123 Comm: mkfs.ext4 Not tainted 5.9.6 #1
Hardware name: BCM2711
PC is at zs_map_object+0x94/0x338
LR is at zram_bvec_rw.constprop.0+0x330/0xa64
pc : [<c0602b38>] lr : [<c0bda6a0>] psr: 60000013
sp : e376bbe0 ip : 00000000 fp : c1e2921c
r10: 00000002 r9 : c1dda730 r8 : 00000000
r7 : e8ff7a00 r6 : 00000000 r5 : 02f9ffa0 r4 : e3710000
r3 : 000fdffe r2 : c1e0ce80 r1 : ebf979a0 r0 : 00000000
Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
Control: 30c5383d Table: 235c2a80 DAC: fffffffd
Process mkfs.ext4 (pid: 123, stack limit = 0x495a22e6)
Stack: (0xe376bbe0 to 0xe376c000)
As it turns out, zsram needs to know the maximum memory size, which
is defined in MAX_PHYSMEM_BITS when CONFIG_SPARSEMEM is set, or in
MAX_POSSIBLE_PHYSMEM_BITS on the x86 architecture.
The same problem will be hit on all 32-bit architectures that have a
physical address space larger than 4GB and happen to not enable sparsemem
and include asm/sparsemem.h from asm/pgtable.h.
After the initial discussion, I suggested just always defining
MAX_POSSIBLE_PHYSMEM_BITS whenever CONFIG_PHYS_ADDR_T_64BIT is
set, or provoking a build error otherwise. This addresses all
configurations that can currently have this runtime bug, but
leaves all other configurations unchanged.
I looked up the possible number of bits in source code and
datasheets, here is what I found:
- on ARC, CONFIG_ARC_HAS_PAE40 controls whether 32 or 40 bits are used
- on ARM, CONFIG_LPAE enables 40 bit addressing, without it we never
support more than 32 bits, even though supersections in theory allow
up to 40 bits as well.
- on MIPS, some MIPS32r1 or later chips support 36 bits, and MIPS32r5
XPA supports up to 60 bits in theory, but 40 bits are more than
anyone will ever ship
- On PowerPC, there are three different implementations of 36 bit
addressing, but 32-bit is used without CONFIG_PTE_64BIT
- On RISC-V, the normal page table format can support 34 bit
addressing. There is no highmem support on RISC-V, so anything
above 2GB is unused, but it might be useful to eventually support
CONFIG_ZRAM for high pages.
Fixes: 61989a80fb ("staging: zsmalloc: zsmalloc memory allocation library")
Fixes: 02390b87a9 ("mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS")
Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Reviewed-by: Stefan Agner <stefan@agner.ch>
Tested-by: Stefan Agner <stefan@agner.ch>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Link: https://lore.kernel.org/linux-mm/bdfa44bf1c570b05d6c70898e2bbb0acf234ecdf.1604762181.git.stefan@agner.ch/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>