commit f07afa0462 upstream.
Even if we don't have extended SCA support, we can have more than 64 CPUs
if we don't enable any HW features that might use the SCA entries.
Now, this works just fine, but we missed a return, which is why we
would actually store the SCA entries. If we have more than 64 CPUs, this
means writing outside of the basic SCA - bad.
Let's fix this. This allows > 64 CPUs when running nested (under vSIE)
without random crashes.
Fixes: a6940674c3 ("KVM: s390: allow 255 VCPUs when sca entries aren't used")
Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20180306132758.21034-1-david@redhat.com>
Cc: stable@vger.kernel.org
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1d037577c3 upstream.
The following commit:
commit aa4d86163e ("block: loop: switch to VFS ITER_BVEC")
replaced __do_lo_send_write(), which used ITER_KVEC iterators, with
lo_write_bvec() which uses ITER_BVEC iterators. In this change, though,
the WRITE flag was lost:
- iov_iter_kvec(&from, ITER_KVEC | WRITE, &kvec, 1, len);
+ iov_iter_bvec(&i, ITER_BVEC, bvec, 1, bvec->bv_len);
This flag is necessary for the DAX case because we make decisions based on
whether or not the iterator is a READ or a WRITE in dax_iomap_actor() and
in dax_iomap_rw().
We end up going through this path in configurations where we combine a PMEM
device with 4k sectors, a loopback device and DAX. The consequence of this
missed flag is that what we intend as a write actually turns into a read in
the DAX code, so no data is ever written.
The very simplest test case is to create a loopback device and try and
write a small string to it, then hexdump a few bytes of the device to see
if the write took. Without this patch you read back all zeros, with this
you read back the string you wrote.
For XFS this causes us to fail or panic during the following xfstests:
xfs/074 xfs/078 xfs/216 xfs/217 xfs/250
For ext4 we have a similar issue where writes never happen, but we don't
currently have any xfstests that use loopback and show this issue.
Fix this by restoring the WRITE flag argument to iov_iter_bvec(). This
causes the xfstests to all pass.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@vger.kernel.org
Fixes: commit aa4d86163e ("block: loop: switch to VFS ITER_BVEC")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ea4f7bd2ac upstream.
If matrix_keypad_stop() is executing and the keypad interrupt is triggered,
disable_row_irqs() may be called by both matrix_keypad_interrupt() and
matrix_keypad_stop() at the same time, causing interrupts to be disabled
twice and the keypad being "stuck" after resuming.
Take lock when setting keypad->stopped to ensure that ISR will not race
with matrix_keypad_stop() disabling interrupts.
Signed-off-by: Zhang Bo <zbsdta@126.com>
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 06a3f0c9f2 upstream.
Commit a3e6c1eff5 ("MIPS: IRQ: Fix disable_irq on CPU IRQs") fixes an
issue where disable_irq did not actually disable the irq. The bug caused
our IPIs to not be disabled, which actually is the correct behavior.
With the addition of commit a3e6c1eff5 ("MIPS: IRQ: Fix disable_irq on
CPU IRQs"), the IPIs were getting disabled going into suspend, thus
schedule_ipi() was not being called. This caused deadlocks where
schedulable task were not being scheduled and other cpus were waiting
for them to do something.
Add the IRQF_NO_SUSPEND flag so an irq_disable will not be called on the
IPIs during suspend.
Signed-off-by: Justin Chen <justinpopo6@gmail.com>
Fixes: a3e6c1eff5 ("MIPS: IRQ: Fix disabled_irq on CPU IRQs")
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: linux-mips@linux-mips.org
Cc: stable@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/17385/
[jhogan@kernel.org: checkpatch: wrap long lines and fix commit refs]
Signed-off-by: James Hogan <jhogan@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 15734feff2 upstream.
radeon's ->runtime_suspend hook calls drm_kms_helper_poll_disable(),
which waits for the output poll worker to finish if it's running.
The output poll worker meanwhile calls pm_runtime_get_sync() in
radeon's ->detect hooks, which waits for the ongoing suspend to finish,
causing a deadlock.
Fix by not acquiring a runtime PM ref if the ->detect hooks are called
in the output poll worker's context. This is safe because the poll
worker is only enabled while runtime active and we know that
->runtime_suspend waits for it to finish.
Stack trace for posterity:
INFO: task kworker/0:3:31847 blocked for more than 120 seconds
Workqueue: events output_poll_execute [drm_kms_helper]
Call Trace:
schedule+0x3c/0x90
rpm_resume+0x1e2/0x690
__pm_runtime_resume+0x3f/0x60
radeon_lvds_detect+0x39/0xf0 [radeon]
output_poll_execute+0xda/0x1e0 [drm_kms_helper]
process_one_work+0x14b/0x440
worker_thread+0x48/0x4a0
INFO: task kworker/2:0:10493 blocked for more than 120 seconds.
Workqueue: pm pm_runtime_work
Call Trace:
schedule+0x3c/0x90
schedule_timeout+0x1b3/0x240
wait_for_common+0xc2/0x180
wait_for_completion+0x1d/0x20
flush_work+0xfc/0x1a0
__cancel_work_timer+0xa5/0x1d0
cancel_delayed_work_sync+0x13/0x20
drm_kms_helper_poll_disable+0x1f/0x30 [drm_kms_helper]
radeon_pmops_runtime_suspend+0x3d/0xa0 [radeon]
pci_pm_runtime_suspend+0x61/0x1a0
vga_switcheroo_runtime_suspend+0x21/0x70
__rpm_callback+0x32/0x70
rpm_callback+0x24/0x80
rpm_suspend+0x12b/0x640
pm_runtime_work+0x6f/0xb0
process_one_work+0x14b/0x440
worker_thread+0x48/0x4a0
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=94147
Fixes: 10ebc0bc09 ("drm/radeon: add runtime PM support (v2)")
Cc: stable@vger.kernel.org # v3.13+: 27d4ee0307: workqueue: Allow retrieval of current task's work struct
Cc: stable@vger.kernel.org # v3.13+: 25c058ccaf: drm: Allow determining if current task is output poll worker
Cc: Ismo Toijala <ismo.toijala@gmail.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Dave Airlie <airlied@redhat.com>
Reviewed-by: Lyude Paul <lyude@redhat.com>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Link: https://patchwork.freedesktop.org/patch/msgid/64ea02c44f91dda19bc563902b97bbc699040392.1518338789.git.lukas@wunner.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d61a5c1063 upstream.
nouveau's ->runtime_suspend hook calls drm_kms_helper_poll_disable(),
which waits for the output poll worker to finish if it's running.
The output poll worker meanwhile calls pm_runtime_get_sync() in
nouveau_connector_detect() which waits for the ongoing suspend to finish,
causing a deadlock.
Fix by not acquiring a runtime PM ref if nouveau_connector_detect() is
called in the output poll worker's context. This is safe because
the poll worker is only enabled while runtime active and we know that
->runtime_suspend waits for it to finish.
Other contexts calling nouveau_connector_detect() do require a runtime
PM ref, these comprise:
status_store() drm sysfs interface
->fill_modes drm callback
drm_fb_helper_probe_connector_modes()
drm_mode_getconnector()
nouveau_connector_hotplug()
nouveau_display_hpd_work()
nv17_tv_set_property()
Stack trace for posterity:
INFO: task kworker/0:1:58 blocked for more than 120 seconds.
Workqueue: events output_poll_execute [drm_kms_helper]
Call Trace:
schedule+0x28/0x80
rpm_resume+0x107/0x6e0
__pm_runtime_resume+0x47/0x70
nouveau_connector_detect+0x7e/0x4a0 [nouveau]
nouveau_connector_detect_lvds+0x132/0x180 [nouveau]
drm_helper_probe_detect_ctx+0x85/0xd0 [drm_kms_helper]
output_poll_execute+0x11e/0x1c0 [drm_kms_helper]
process_one_work+0x184/0x380
worker_thread+0x2e/0x390
INFO: task kworker/0:2:252 blocked for more than 120 seconds.
Workqueue: pm pm_runtime_work
Call Trace:
schedule+0x28/0x80
schedule_timeout+0x1e3/0x370
wait_for_completion+0x123/0x190
flush_work+0x142/0x1c0
nouveau_pmops_runtime_suspend+0x7e/0xd0 [nouveau]
pci_pm_runtime_suspend+0x5c/0x180
vga_switcheroo_runtime_suspend+0x1e/0xa0
__rpm_callback+0xc1/0x200
rpm_callback+0x1f/0x70
rpm_suspend+0x13c/0x640
pm_runtime_work+0x6e/0x90
process_one_work+0x184/0x380
worker_thread+0x2e/0x390
Bugzilla: https://bugs.archlinux.org/task/53497
Bugzilla: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=870523
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=70388#c33
Fixes: 5addcf0a5f ("nouveau: add runtime PM support (v0.9)")
Cc: stable@vger.kernel.org # v3.12+: 27d4ee0307: workqueue: Allow retrieval of current task's work struct
Cc: stable@vger.kernel.org # v3.12+: 25c058ccaf: drm: Allow determining if current task is output poll worker
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Dave Airlie <airlied@redhat.com>
Reviewed-by: Lyude Paul <lyude@redhat.com>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Link: https://patchwork.freedesktop.org/patch/msgid/b7d2cbb609a80f59ccabfdf479b9d5907c603ea1.1518338789.git.lukas@wunner.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 25c058ccaf upstream.
Introduce a helper to determine if the current task is an output poll
worker.
This allows us to fix a long-standing deadlock in several DRM drivers
wherein the ->runtime_suspend callback waits for the output poll worker
to finish and the worker in turn calls a ->detect callback which waits
for runtime suspend to finish. The ->detect callback is invoked from
multiple call sites and waiting for runtime suspend to finish is the
correct thing to do except if it's executing in the context of the
worker.
v2: Expand kerneldoc to specifically mention deadlock between
output poll worker and autosuspend worker as use case. (Lyude)
Cc: Dave Airlie <airlied@redhat.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Lyude Paul <lyude@redhat.com>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Link: https://patchwork.freedesktop.org/patch/msgid/3549ce32e7f1467102e70d3e9cbf70c46bfe108e.1518593424.git.lukas@wunner.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 90024a5951 upstream.
The ACK/NACK implementation as found in e.g. the G965 has the falling
clock edge and the release of the data line after the ACK for the received
byte happen at the same time.
This is conformant with the I2C specification, which allows a zero hold
time, see footnote [3]: "A device must internally provide a hold time of
at least 300 ns for the SDA signal (with respect to the V IH(min) of the
SCL signal) to bridge the undefined region of the falling edge of SCL."
Some HDMI-to-VGA converters apparently fail to adhere to this requirement
and latch SDA at the falling clock edge, so instead of an ACK
sometimes a NACK is read and the slave (i.e. the EDID ROM) ends the
transfer.
The bitbanging releases the data line for the ACK only 1/4 bit time after
the falling clock edge, so a slave will see the correct value no matter
if it samples at the rising or the falling clock edge or in the center.
Fallback to bitbanging is already done for the CRT connector.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=92685
Signed-off-by: Stefan Brüns <stefan.bruens@rwth-aachen.de>
Cc: stable@vger.kernel.org
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/a39f080b-81a5-4c93-b3f7-7cb0a58daca3@rwthex-w2-a.rwth-ad.de
(cherry picked from commit cfb926e148)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6a21dfc0d0 upstream.
Users of ucma are supposed to provide size of option level,
in most paths it is supposed to be equal to u8 or u16, but
it is not the case for the IB path record, where it can be
multiple of struct ib_path_rec_data.
This patch takes simplest possible approach and prevents providing
values more than possible to allocate.
Reported-by: syzbot+a38b0e9f694c379ca7ce@syzkaller.appspotmail.com
Fixes: 7ce86409ad ("RDMA/ucma: Allow user space to set service type")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d7d8249665 upstream.
When changing a file's acl mask, btrfs_set_acl() will first set the
group bits of i_mode to the value of the mask, and only then set the
actual extended attribute representing the new acl.
If the second part fails (due to lack of space, for example) and the
file had no acl attribute to begin with, the system will from now on
assume that the mask permission bits are actual group permission bits,
potentially granting access to the wrong users.
Prevent this by restoring the original mode bits if __btrfs_set_acl
fails.
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ upstream commit d269176e76 ]
While working on 16338a9b3a ("bpf, arm64: fix out of bounds access in
tail call") I noticed that ppc64 JIT is partially affected as well. While
the bound checking is correctly performed as unsigned comparison, the
register with the index value however, is never truncated into 32 bit
space, so e.g. a index value of 0x100000000ULL with a map of 1 element
would pass with PPC_CMPLW() whereas we later on continue with the full
64 bit register value. Therefore, as we do in interpreter and other JITs
truncate the value to 32 bit initially in order to fix access.
Fixes: ce0761419f ("powerpc/bpf: Implement support for tail calls")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Tested-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ upstream commit 32fff239de ]
syszbot managed to trigger RCU detected stalls in
bpf_array_free_percpu()
It takes time to allocate a huge percpu map, but even more time to free
it.
Since we run in process context, use cond_resched() to yield cpu if
needed.
Fixes: a10423b87a ("bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY map")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ upstream commit 16338a9b3a ]
I recently noticed a crash on arm64 when feeding a bogus index
into BPF tail call helper. The crash would not occur when the
interpreter is used, but only in case of JIT. Output looks as
follows:
[ 347.007486] Unable to handle kernel paging request at virtual address fffb850e96492510
[...]
[ 347.043065] [fffb850e96492510] address between user and kernel address ranges
[ 347.050205] Internal error: Oops: 96000004 [#1] SMP
[...]
[ 347.190829] x13: 0000000000000000 x12: 0000000000000000
[ 347.196128] x11: fffc047ebe782800 x10: ffff808fd7d0fd10
[ 347.201427] x9 : 0000000000000000 x8 : 0000000000000000
[ 347.206726] x7 : 0000000000000000 x6 : 001c991738000000
[ 347.212025] x5 : 0000000000000018 x4 : 000000000000ba5a
[ 347.217325] x3 : 00000000000329c4 x2 : ffff808fd7cf0500
[ 347.222625] x1 : ffff808fd7d0fc00 x0 : ffff808fd7cf0500
[ 347.227926] Process test_verifier (pid: 4548, stack limit = 0x000000007467fa61)
[ 347.235221] Call trace:
[ 347.237656] 0xffff000002f3a4fc
[ 347.240784] bpf_test_run+0x78/0xf8
[ 347.244260] bpf_prog_test_run_skb+0x148/0x230
[ 347.248694] SyS_bpf+0x77c/0x1110
[ 347.251999] el0_svc_naked+0x30/0x34
[ 347.255564] Code: 9100075a d280220a 8b0a002a d37df04b (f86b694b)
[...]
In this case the index used in BPF r3 is the same as in r1
at the time of the call, meaning we fed a pointer as index;
here, it had the value 0xffff808fd7cf0500 which sits in x2.
While I found tail calls to be working in general (also for
hitting the error cases), I noticed the following in the code
emission:
# bpftool p d j i 988
[...]
38: ldr w10, [x1,x10]
3c: cmp w2, w10
40: b.ge 0x000000000000007c <-- signed cmp
44: mov x10, #0x20 // #32
48: cmp x26, x10
4c: b.gt 0x000000000000007c
50: add x26, x26, #0x1
54: mov x10, #0x110 // #272
58: add x10, x1, x10
5c: lsl x11, x2, #3
60: ldr x11, [x10,x11] <-- faulting insn (f86b694b)
64: cbz x11, 0x000000000000007c
[...]
Meaning, the tests passed because commit ddb55992b0 ("arm64:
bpf: implement bpf_tail_call() helper") was using signed compares
instead of unsigned which as a result had the test wrongly passing.
Change this but also the tail call count test both into unsigned
and cap the index as u32. Latter we did as well in 90caccdd8c
("bpf: fix bpf_tail_call() x64 JIT") and is needed in addition here,
too. Tested on HiSilicon Hi1616.
Result after patch:
# bpftool p d j i 268
[...]
38: ldr w10, [x1,x10]
3c: add w2, w2, #0x0
40: cmp w2, w10
44: b.cs 0x0000000000000080
48: mov x10, #0x20 // #32
4c: cmp x26, x10
50: b.hi 0x0000000000000080
54: add x26, x26, #0x1
58: mov x10, #0x110 // #272
5c: add x10, x1, x10
60: lsl x11, x2, #3
64: ldr x11, [x10,x11]
68: cbz x11, 0x0000000000000080
[...]
Fixes: ddb55992b0 ("arm64: bpf: implement bpf_tail_call() helper")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ upstream commit a493a87f38 ]
Implement a retpoline [0] for the BPF tail call JIT'ing that converts
the indirect jump via jmp %rax that is used to make the long jump into
another JITed BPF image. Since this is subject to speculative execution,
we need to control the transient instruction sequence here as well
when CONFIG_RETPOLINE is set, and direct it into a pause + lfence loop.
The latter aligns also with what gcc / clang emits (e.g. [1]).
JIT dump after patch:
# bpftool p d x i 1
0: (18) r2 = map[id:1]
2: (b7) r3 = 0
3: (85) call bpf_tail_call#12
4: (b7) r0 = 2
5: (95) exit
With CONFIG_RETPOLINE:
# bpftool p d j i 1
[...]
33: cmp %edx,0x24(%rsi)
36: jbe 0x0000000000000072 |*
38: mov 0x24(%rbp),%eax
3e: cmp $0x20,%eax
41: ja 0x0000000000000072 |
43: add $0x1,%eax
46: mov %eax,0x24(%rbp)
4c: mov 0x90(%rsi,%rdx,8),%rax
54: test %rax,%rax
57: je 0x0000000000000072 |
59: mov 0x28(%rax),%rax
5d: add $0x25,%rax
61: callq 0x000000000000006d |+
66: pause |
68: lfence |
6b: jmp 0x0000000000000066 |
6d: mov %rax,(%rsp) |
71: retq |
72: mov $0x2,%eax
[...]
* relative fall-through jumps in error case
+ retpoline for indirect jump
Without CONFIG_RETPOLINE:
# bpftool p d j i 1
[...]
33: cmp %edx,0x24(%rsi)
36: jbe 0x0000000000000063 |*
38: mov 0x24(%rbp),%eax
3e: cmp $0x20,%eax
41: ja 0x0000000000000063 |
43: add $0x1,%eax
46: mov %eax,0x24(%rbp)
4c: mov 0x90(%rsi,%rdx,8),%rax
54: test %rax,%rax
57: je 0x0000000000000063 |
59: mov 0x28(%rax),%rax
5d: add $0x25,%rax
61: jmpq *%rax |-
63: mov $0x2,%eax
[...]
* relative fall-through jumps in error case
- plain indirect jump as before
[0] https://support.google.com/faqs/answer/7625886
[1] a31e654fa1
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ upstream commit 9c2d63b843 ]
syzkaller recently triggered OOM during percpu map allocation;
while there is work in progress by Dennis Zhou to add __GFP_NORETRY
semantics for percpu allocator under pressure, there seems also a
missing bpf_map_precharge_memlock() check in array map allocation.
Given today the actual bpf_map_charge_memlock() happens after the
find_and_alloc_map() in syscall path, the bpf_map_precharge_memlock()
is there to bail out early before we go and do the map setup work
when we find that we hit the limits anyway. Therefore add this for
array map as well.
Fixes: 6c90598174 ("bpf: pre-allocate hash map elements")
Fixes: a10423b87a ("bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY map")
Reported-by: syzbot+adb03f3f0bb57ce3acda@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Dennis Zhou <dennisszhou@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ upstream commit a316338cb7 ]
trie_alloc() always needs to have BPF_F_NO_PREALLOC passed in via
attr->map_flags, since it does not support preallocation yet. We
check the flag, but we never copy the flag into trie->map.map_flags,
which is later on exposed into fdinfo and used by loaders such as
iproute2. Latter uses this in bpf_map_selfcheck_pinned() to test
whether a pinned map has the same spec as the one from the BPF obj
file and if not, bails out, which is currently the case for lpm
since it exposes always 0 as flags.
Also copy over flags in array_map_alloc() and stack_map_alloc().
They always have to be 0 right now, but we should make sure to not
miss to copy them over at a later point in time when we add actual
flags for them to use.
Fixes: b95a5c4db0 ("bpf: add a longest prefix match trie map implementation")
Reported-by: Jarno Rajahalme <jarno@covalent.io>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3968523f85 upstream.
mpls_label_ok() validates that the 'platform_label' array index from a
userspace netlink message payload is valid. Under speculation the
mpls_label_ok() result may not resolve in the CPU pipeline until after
the index is used to access an array element. Sanitize the index to zero
to prevent userspace-controlled arbitrary out-of-bounds speculation, a
precursor for a speculative execution side channel vulnerability.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 4.4:
- mpls_label_ok() doesn't take an extack parameter
- Drop change in mpls_getroute()]
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 07f2c7ab6f ]
When SCTP makes INIT or INIT_ACK packet the total chunk length
can exceed SCTP_MAX_CHUNK_LEN which leads to kernel panic when
transmitting these packets, e.g. the crash on sending INIT_ACK:
[ 597.804948] skbuff: skb_over_panic: text:00000000ffae06e4 len:120168
put:120156 head:000000007aa47635 data:00000000d991c2de
tail:0x1d640 end:0xfec0 dev:<NULL>
...
[ 597.976970] ------------[ cut here ]------------
[ 598.033408] kernel BUG at net/core/skbuff.c:104!
[ 600.314841] Call Trace:
[ 600.345829] <IRQ>
[ 600.371639] ? sctp_packet_transmit+0x2095/0x26d0 [sctp]
[ 600.436934] skb_put+0x16c/0x200
[ 600.477295] sctp_packet_transmit+0x2095/0x26d0 [sctp]
[ 600.540630] ? sctp_packet_config+0x890/0x890 [sctp]
[ 600.601781] ? __sctp_packet_append_chunk+0x3b4/0xd00 [sctp]
[ 600.671356] ? sctp_cmp_addr_exact+0x3f/0x90 [sctp]
[ 600.731482] sctp_outq_flush+0x663/0x30d0 [sctp]
[ 600.788565] ? sctp_make_init+0xbf0/0xbf0 [sctp]
[ 600.845555] ? sctp_check_transmitted+0x18f0/0x18f0 [sctp]
[ 600.912945] ? sctp_outq_tail+0x631/0x9d0 [sctp]
[ 600.969936] sctp_cmd_interpreter.isra.22+0x3be1/0x5cb0 [sctp]
[ 601.041593] ? sctp_sf_do_5_1B_init+0x85f/0xc30 [sctp]
[ 601.104837] ? sctp_generate_t1_cookie_event+0x20/0x20 [sctp]
[ 601.175436] ? sctp_eat_data+0x1710/0x1710 [sctp]
[ 601.233575] sctp_do_sm+0x182/0x560 [sctp]
[ 601.284328] ? sctp_has_association+0x70/0x70 [sctp]
[ 601.345586] ? sctp_rcv+0xef4/0x32f0 [sctp]
[ 601.397478] ? sctp6_rcv+0xa/0x20 [sctp]
...
Here the chunk size for INIT_ACK packet becomes too big, mostly
because of the state cookie (INIT packet has large size with
many address parameters), plus additional server parameters.
Later this chunk causes the panic in skb_put_data():
skb_packet_transmit()
sctp_packet_pack()
skb_put_data(nskb, chunk->skb->data, chunk->skb->len);
'nskb' (head skb) was previously allocated with packet->size
from u16 'chunk->chunk_hdr->length'.
As suggested by Marcelo we should check the chunk's length in
_sctp_make_chunk() before trying to allocate skb for it and
discard a chunk if its size bigger than SCTP_MAX_CHUNK_LEN.
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leinter@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit d22ffb5a71 ]
If multiple IPA commands are build & sent out concurrently,
fill_ipacmd_header() may assign a seqno value to a command that's
different from what send_control_data() later assigns to this command's
reply.
This is due to other commands passing through send_control_data(),
and incrementing card->seqno.ipa along the way.
So one IPA command has no reply that's waiting for its seqno, while some
other IPA command has multiple reply objects waiting for it.
Only one of those waiting replies wins, and the other(s) times out and
triggers a recovery via send_ipa_cmd().
Fix this by making sure that the same seqno value is assigned to
a command and its reply object.
Do so immediately before submitting the command & while holding the
irq_pending "lock", to produce nicely ascending seqnos.
As a side effect, *all* IPA commands now use a reply object that's
waiting for its actual seqno. Previously, early IPA commands that were
submitted while the card was still DOWN used the "catch-all" IDX seqno.
Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit c5c48c58b2 ]
Current code ("qeth_l3_ip_from_hash()") matches a queried address object
against objects in the IP table by IP address, Mask/Prefix Length and
MAC address ("qeth_l3_ipaddrs_is_equal()"). But what callers actually
require is either
a) "is this IP address registered" (ie. match by IP address only),
before adding a new address.
b) or "is this address object registered" (ie. match all relevant
attributes), before deleting an address.
Right now
1. the ADD path is too strict in its lookup, and eg. doesn't detect
conflicts between an existing NORMAL address and a new VIPA address
(because the NORMAL address will have mask != 0, while VIPA has
a mask == 0),
2. the DELETE path is not strict enough, and eg. allows del_rxip() to
delete a VIPA address as long as the IP address matches.
Fix all this by adding helpers (_addr_match_ip() and _addr_match_all())
that do the appropriate checking.
Note that the ADD path for NORMAL addresses is special, as qeth keeps
track of how many times such an address is in use (and there is no
immediate way of returning errors to the caller). So when a requested
NORMAL address _fully_ matches an existing one, it's not considered a
conflict and we merely increment the refcount.
Fixes: 5f78e29cee ("qeth: optimize IP handling in rx_mode callback")
Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 14d066c353 ]
Registering an IPv4 address with the HW takes quite a while, so we
temporarily drop the ip_htable lock. Any concurrent add/remove of the
same IP adjusts the IP's use count, and (on remove) is then blocked by
addr->in_progress.
After the register call has completed, we check the use count for
concurrently attempted add/remove calls - and possibly straight-away
deregister the IP again. This happens via l3_delete_ip(), which
1) looks up the queried IP in the htable (getting a reference to the
*same* queried object),
2) deregisters the IP from the HW, and
3) frees the IP object.
The caller in l3_add_ip() then does a second free on the same object.
For this case, skip all the extra checks and lookups in l3_delete_ip()
and just deregister & free the IP object ourselves.
Fixes: 5f78e29cee ("qeth: optimize IP handling in rx_mode callback")
Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 98d823ab1f ]
If the HW is not reachable, then none of the IPs in qeth's internal
table has been registered with the HW yet. So when deleting such an IP,
there's no need to stage it for deregistration - just drop it from
the table.
This fixes the "add-delete-add" scenario on an offline card, where the
the second "add" merely increments the IP's use count. But as the IP is
still set to DISP_ADDR_DELETE from the previous "delete" step,
l3_recover_ip() won't register it with the HW when the card goes online.
Fixes: 5f78e29cee ("qeth: optimize IP handling in rx_mode callback")
Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 12472af896 ]
qeth_get_elements_for_range() doesn't know how to handle a 0-length
range (ie. start == end), and returns 1 when it should return 0.
Such ranges occur on TSO skbs, where the L2/L3/L4 headers (and thus all
of the skb's linear data) are skipped when mapping the skb into regular
buffer elements.
This overestimation may cause several performance-related issues:
1. sub-optimal IO buffer selection, where the next buffer gets selected
even though the skb would actually still fit into the current buffer.
2. forced linearization, if the element count for a non-linear skb
exceeds QETH_MAX_BUFFER_ELEMENTS.
Rather than modifying qeth_get_elements_for_range() and adding overhead
to every caller, fix up those callers that are in risk of passing a
0-length range.
Fixes: 2863c61334 ("qeth: refactor calculation of SBALE count")
Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 1c5b2216fb ]
send_control_data() applies some special handling to SETIP v4 IPA
commands. But current code parses *all* command types for the SETIP
command code. Limit the command code check to IPA commands.
Fixes: 5b54e16f1a ("qeth: do not spin for SETIP ip assist command")
Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 89271c65ed ]
For a memory range/skb where the last byte falls onto a page boundary
(ie. 'end' is of the form xxx...xxx001), the PFN_UP() part of the
calculation currently doesn't round up to the next PFN due to an
off-by-one error.
Thus qeth believes that the skb occupies one page less than it
actually does, and may select a IO buffer that doesn't have enough spare
buffer elements to fit all of the skb's data.
HW detects this as a malformed buffer descriptor, and raises an
exception which then triggers device recovery.
Fixes: 2863c61334 ("qeth: refactor calculation of SBALE count")
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 957d761cf9 ]
When going through the bind address list in sctp_v6_get_dst() and
the previously found address is better ('matchlen > bmatchlen'),
the code continues to the next iteration without releasing currently
held destination.
Fix it by releasing 'bdst' before continue to the next iteration, and
instead of introducing one more '!IS_ERR(bdst)' check for dst_release(),
move the already existed one right after ip6_dst_lookup_flow(), i.e. we
shouldn't proceed further if we get an error for the route lookup.
Fixes: dbc2b5e9a0 ("sctp: fix src address selection if using secondary addresses for ipv6")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 350c9f484b ]
BBR uses tcp_tso_autosize() in an attempt to probe what would be the
burst sizes and to adjust cwnd in bbr_target_cwnd() with following
gold formula :
/* Allow enough full-sized skbs in flight to utilize end systems. */
cwnd += 3 * bbr->tso_segs_goal;
But GSO can be lacking or be constrained to very small
units (ip link set dev ... gso_max_segs 2)
What we really want is to have enough packets in flight so that both
GSO and GRO are efficient.
So in the case GSO is off or downgraded, we still want to have the same
number of packets in flight as if GSO/TSO was fully operational, so
that GRO can hopefully be working efficiently.
To fix this issue, we make tcp_tso_autosize() unaware of
sk->sk_gso_max_segs
Only tcp_tso_segs() has to enforce the gso_max_segs limit.
Tested:
ethtool -K eth0 tso off gso off
tc qd replace dev eth0 root pfifo_fast
Before patch:
for f in {1..5}; do ./super_netperf 1 -H lpaa24 -- -K bbr; done
691 (ss -temoi shows cwnd is stuck around 6 )
667
651
631
517
After patch :
# for f in {1..5}; do ./super_netperf 1 -H lpaa24 -- -K bbr; done
1733 (ss -temoi shows cwnd is around 386 )
1778
1746
1781
1718
Fixes: 0f8782ea14 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 93c62c45ed ]
All the kernel_sendmsg() calls in rxrpc_send_data_packet() need to send
both parts of the iov[] buffer, but one of them does not. Fix it so that
it does.
Without this, short IPv6 rxrpc DATA packets may be seen that have the rxrpc
header included, but no payload.
Fixes: 5a924b8951 ("rxrpc: Don't store the rxrpc header in the Tx queue sk_buffs")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 08f5138512 ]
This condition wasn't adjusted when PHY_IGNORE_INTERRUPT (-2) was added
long ago. In case of PHY_IGNORE_INTERRUPT the MAC interrupt indicates
also PHY state changes and we should do what the symbol says.
Fixes: 84a527a41f ("net: phylib: fix interrupts re-enablement in phy_start")
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 0a8a1bf17e ]
Until now, we assumed that in case of error when adding FDB entries, the
write operation will fail, but this is not the case. Instead, we need to
check that the number of entries reported in the response is equal to
the number of entries specified in the request.
Fixes: 56ade8fe3f ("mlxsw: spectrum: Add initial support for Spectrum ASIC")
Reported-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Shalom Toledo <shalomt@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 4a31a6b19f ]
Fix dst reference count leak in sctp_v4_get_dst() introduced in commit
410f03831 ("sctp: add routing output fallback"):
When walking the address_list, successive ip_route_output_key() calls
may return the same rt->dst with the reference incremented on each call.
The code would not decrement the dst refcount when the dst pointer was
identical from the previous iteration, causing the dst refcnt leak.
Testcase:
ip netns add TEST
ip netns exec TEST ip link set lo up
ip link add dummy0 type dummy
ip link add dummy1 type dummy
ip link add dummy2 type dummy
ip link set dev dummy0 netns TEST
ip link set dev dummy1 netns TEST
ip link set dev dummy2 netns TEST
ip netns exec TEST ip addr add 192.168.1.1/24 dev dummy0
ip netns exec TEST ip link set dummy0 up
ip netns exec TEST ip addr add 192.168.1.2/24 dev dummy1
ip netns exec TEST ip link set dummy1 up
ip netns exec TEST ip addr add 192.168.1.3/24 dev dummy2
ip netns exec TEST ip link set dummy2 up
ip netns exec TEST sctp_test -H 192.168.1.2 -P 20002 -h 192.168.1.1 -p 20000 -s -B 192.168.1.3
ip netns del TEST
In 4.4 and 4.9 kernels this results to:
[ 354.179591] unregister_netdevice: waiting for lo to become free. Usage count = 1
[ 364.419674] unregister_netdevice: waiting for lo to become free. Usage count = 1
[ 374.663664] unregister_netdevice: waiting for lo to become free. Usage count = 1
[ 384.903717] unregister_netdevice: waiting for lo to become free. Usage count = 1
[ 395.143724] unregister_netdevice: waiting for lo to become free. Usage count = 1
[ 405.383645] unregister_netdevice: waiting for lo to become free. Usage count = 1
...
Fixes: 410f03831 ("sctp: add routing output fallback")
Fixes: 0ca50d12f ("sctp: fix src address selection if using secondary addresses")
Signed-off-by: Tommi Rantala <tommi.t.rantala@nokia.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 15f35d49c9 ]
Since UDP-Lite is always using checksum, the following path is
triggered when calculating pseudo header for it:
udp4_csum_init() or udp6_csum_init()
skb_checksum_init_zero_check()
__skb_checksum_validate_complete()
The problem can appear if skb->len is less than CHECKSUM_BREAK. In
this particular case __skb_checksum_validate_complete() also invokes
__skb_checksum_complete(skb). If UDP-Lite is using partial checksum
that covers only part of a packet, the function will return bad
checksum and the packet will be dropped.
It can be fixed if we skip skb_checksum_init_zero_check() and only
set the required pseudo header checksum for UDP-Lite with partial
checksum before udp4_csum_init()/udp6_csum_init() functions return.
Fixes: ed70fcfcee ("net: Call skb_checksum_init in IPv4")
Fixes: e4f45b7f40 ("net: Call skb_checksum_init in IPv6")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 77f840e3e5 ]
PPP units don't hold any reference on the channels connected to it.
It is the channel's responsibility to ensure that it disconnects from
its unit before being destroyed.
In practice, this is ensured by ppp_unregister_channel() disconnecting
the channel from the unit before dropping a reference on the channel.
However, it is possible for an unregistered channel to connect to a PPP
unit: register a channel with ppp_register_net_channel(), attach a
/dev/ppp file to it with ioctl(PPPIOCATTCHAN), unregister the channel
with ppp_unregister_channel() and finally connect the /dev/ppp file to
a PPP unit with ioctl(PPPIOCCONNECT).
Once in this situation, the channel is only held by the /dev/ppp file,
which can be released at anytime and free the channel without letting
the parent PPP unit know. Then the ppp structure ends up with dangling
pointers in its ->channels list.
Prevent this scenario by forbidding unregistered channels from
connecting to PPP units. This maintains the code logic by keeping
ppp_unregister_channel() responsible from disconnecting the channel if
necessary and avoids modification on the reference counting mechanism.
This issue seems to predate git history (successfully reproduced on
Linux 2.6.26 and earlier PPP commits are unrelated).
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>