Logical block size has type unsigned short. That means that it can be at
most 32768. However, there are architectures that can run with 64k pages
(for example arm64) and on these architectures, it may be possible to
create block devices with 64k block size.
For exmaple (run this on an architecture with 64k pages):
Mount will fail with this error because it tries to read the superblock using 2-sector
access:
device-mapper: writecache: I/O is not aligned, sector 2, size 1024, block size 65536
EXT4-fs (dm-0): unable to read superblock
This patch changes the logical block size from unsigned short to unsigned
int to avoid the overflow.
Cc: stable@vger.kernel.org
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
req->result is cleared when io_issue_sqe() calls io_read/write_pre()
routines. Those routines however are not called when the sqe
argument is NULL, which is the case when io_issue_sqe() is called from
io_wq_submit_work(). io_issue_sqe() may then examine a stale result if
a polled request had previously failed with -EAGAIN:
if (ctx->flags & IORING_SETUP_IOPOLL) {
if (req->result == -EAGAIN)
return -EAGAIN;
io_iopoll_req_issued(req);
}
and in turn cause a subsequently completed request to be re-issued in
io_wq_submit_work().
Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
IOCTLs causing firmware fault may end up in failed controller resets and
finally killing the adapter.
This patch fixes this problem as stated below:
In OCR sequence, driver will attempt refiring pended IOCTLs upto two times.
If first two attempts fail, then in third attempt driver will return pended
IOCTLs with EBUSY status to application. These changes are done to ensure
if any of pended IOCTLs is causing firmware fault and resulting into OCR
failure, then in last attempt of OCR driver will refrain firing it to
firmware and saving adapter from being killed due to faulty IOCTL.
Link: https://lore.kernel.org/r/1579000882-20246-10-git-send-email-anand.lodnoor@broadcom.com
Signed-off-by: Sumit Saxena <sumit.saxena@broadcom.com>
Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
Signed-off-by: Chandrakanth Patil <chandrakanth.patil@broadcom.com>
Signed-off-by: Anand Lodnoor <anand.lodnoor@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Tom Hatskevich reported that we look up "iocp" then, in the called
functions we do a second copy_from_user() and look it up again.
The problem that could cause is:
drivers/message/fusion/mptctl.c
674 /* All of these commands require an interrupt or
675 * are unknown/illegal.
676 */
677 if ((ret = mptctl_syscall_down(iocp, nonblock)) != 0)
^^^^
We take this lock.
678 return ret;
679
680 if (cmd == MPTFWDOWNLOAD)
681 ret = mptctl_fw_download(arg);
^^^
Then the user memory changes and we look up "iocp" again but a different
one so now we are holding the incorrect lock and have a race condition.
682 else if (cmd == MPTCOMMAND)
683 ret = mptctl_mpt_command(arg);
The security impact of this bug is not as bad as it could have been
because these operations are all privileged and root already has
enormous destructive power. But it's still worth fixing.
This patch passes the "iocp" pointer to the functions to avoid the
second lookup. That deletes 100 lines of code from the driver so
it's a nice clean up as well.
Link: https://lore.kernel.org/r/20200114123414.GA7957@kadam
Reported-by: Tom Hatskevich <tom2001tom.23@gmail.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Commit 0ed8810276 ("scsi: storvsc: setup 1:1 mapping between hardware
queue and CPU queue") introduced a regression for disks attached to
IDE. For these disks the host VSP only offers one VMBUS channel. Setting
multiple queues can overload the VMBUS channel and result in performance
drop for high queue depth workload on system with large number of CPUs.
Fix it by leaving the number of hardware queues to 1 (default value) for
IDE disks.
Fixes: 0ed8810276 ("scsi: storvsc: setup 1:1 mapping between hardware queue and CPU queue")
Link: https://lore.kernel.org/r/1578960516-108228-1-git-send-email-longli@linuxonhyperv.com
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
gcc -O3 warns that some local variables are not properly initialized:
drivers/scsi/fnic/vnic_dev.c: In function 'fnic_dev_hang_notify':
drivers/scsi/fnic/vnic_dev.c:511:16: error: 'a0' is used uninitialized in this function [-Werror=uninitialized]
vdev->args[0] = *a0;
~~~~~~~~~~~~~~^~~~~
drivers/scsi/fnic/vnic_dev.c:691:6: note: 'a0' was declared here
u64 a0, a1;
^~
drivers/scsi/fnic/vnic_dev.c:512:16: error: 'a1' is used uninitialized in this function [-Werror=uninitialized]
vdev->args[1] = *a1;
~~~~~~~~~~~~~~^~~~~
drivers/scsi/fnic/vnic_dev.c:691:10: note: 'a1' was declared here
u64 a0, a1;
^~
drivers/scsi/fnic/vnic_dev.c: In function 'fnic_dev_mac_addr':
drivers/scsi/fnic/vnic_dev.c:512:16: error: 'a1' is used uninitialized in this function [-Werror=uninitialized]
vdev->args[1] = *a1;
~~~~~~~~~~~~~~^~~~~
drivers/scsi/fnic/vnic_dev.c:698:10: note: 'a1' was declared here
u64 a0, a1;
^~
Apparently the code relies on the local variables occupying adjacent memory
locations in the same order, but this is of course not guaranteed.
Use an array of two u64 variables where needed to make it work correctly.
I suspect there is also an endianness bug here, but have not digged in deep
enough to be sure.
Fixes: 5df6d737dd ("[SCSI] fnic: Add new Cisco PCI-Express FCoE HBA")
Fixes: mmtom ("init/Kconfig: enable -O3 for all arches")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20200107201602.4096790-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
If the transport cannot be registered, the session/connection creation
needs to be failed early to let the initiator know. Otherwise, the system
will have an outstanding connection that cannot be used nor removed by
open-iscsi. The result is similar to the error below, triggered by
injecting a failure in the transport's registration path.
openiscsi reports success:
root@debian-vm:~# iscsiadm -m node -T iqn:lun1 -p 127.0.0.1 -l
Logging in to [iface: default, target: iqn:lun1, portal: 127.0.0.1,3260]
Login to [iface: default, target: iqn:lun1, portal:127.0.0.1,3260] successful.
But cannot remove the session afterwards, since the kernel is in an
inconsistent state.
root@debian-vm:~# iscsiadm -m node -T iqn:lun1 -p 127.0.0.1 -u
iscsiadm: No matching sessions found
Link: https://lore.kernel.org/r/20200106185817.640331-4-krisman@collabora.com
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
attribute_container_device_trigger invokes callbacks that may fail for one
or more classdevs, for instance, the transport_add_class_device callback,
called during transport creation, does memory allocation. This
information, though, is not propagated to upper layers, and any driver
using the attribute_container_device_trigger API will not know whether any,
some, or all callbacks succeeded.
This patch implements a safe version of this dispatcher, to either succeed
all the callbacks or revert to the original state.
Link: https://lore.kernel.org/r/20200106185817.640331-2-krisman@collabora.com
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Add ability to specify a list of test name substrings for selecting which
tests to run. So now -t is accepting a comma-separated list of strings,
similarly to how -n accepts a comma-separated list of test numbers.
Additionally, add ability to blacklist tests by name. Blacklist takes
precedence over whitelist. Blacklisting is important for cases where it's
known that some tests can't pass (e.g., due to perf hardware events that are
not available within VM). This is going to be used for libbpf testing in
Travis CI in its Github repo.
Example runs with just whitelist and whitelist + blacklist:
$ sudo ./test_progs -tattach,core/existence
#1 attach_probe:OK
#6 cgroup_attach_autodetach:OK
#7 cgroup_attach_multi:OK
#8 cgroup_attach_override:OK
#9 core_extern:OK
#10/44 existence:OK
#10/45 existence___minimal:OK
#10/46 existence__err_int_sz:OK
#10/47 existence__err_int_type:OK
#10/48 existence__err_int_kind:OK
#10/49 existence__err_arr_kind:OK
#10/50 existence__err_arr_value_type:OK
#10/51 existence__err_struct_type:OK
#10 core_reloc:OK
#19 flow_dissector_reattach:OK
#60 tp_attach_query:OK
Summary: 8/8 PASSED, 0 SKIPPED, 0 FAILED
$ sudo ./test_progs -tattach,core/existence -bcgroup,flow/arr
#1 attach_probe:OK
#9 core_extern:OK
#10/44 existence:OK
#10/45 existence___minimal:OK
#10/46 existence__err_int_sz:OK
#10/47 existence__err_int_type:OK
#10/48 existence__err_int_kind:OK
#10/51 existence__err_struct_type:OK
#10 core_reloc:OK
#60 tp_attach_query:OK
Summary: 4/6 PASSED, 0 SKIPPED, 0 FAILED
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Julia Kartseva <hex@fb.com>
Link: https://lore.kernel.org/bpf/20200116005549.3644118-1-andriin@fb.com
The code in secondary_park is currently placed in the .init section. The
kernel reclaims and clears this code when it finishes booting. That
causes the cores parked in it to go to somewhere unpredictable, so we
move this function out of init to make sure the cores stay looping there.
The instruction bgeu a0, t0, .Lsecondary_park may have "a relocation
truncated to fit" issue during linking time. It is because that sections
are too far to jump. Let's use tail to jump to the .Lsecondary_park.
Signed-off-by: Greentime Hu <greentime.hu@sifive.com>
Reviewed-by: Anup Patel <anup.patel@sifive.com>
Cc: Andreas Schwab <schwab@suse.de>
Cc: stable@vger.kernel.org
Fixes: 76d2a0493a ("RISC-V: Init and Halt Code")
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
This enables bcm2711's PCIe bus, which is hardwired to a VIA
Technologies XHCI USB 3.0 controller.
Signed-off-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
With the introduction of the Raspberry Pi 4 we were forced to explicitly
configure CMA's location, since arm64 defaults it into the ZONE_DMA32
memory area, which is not good enough to perform DMA operations on that
device. To bypass this limitation a dedicated CMA DT node was created,
explicitly indicating the acceptable memory range and size.
That said, compatibility between boards is a must on the Raspberry Pi
ecosystem so this creates a common CMA DT node so as for DT overlays to
be able to update CMA's properties regardless of the board being used.
Signed-off-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
Reviewed-by: Phil Elwell <phil@raspberrypi.org>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Martin Lau says:
====================
When a map is storing a kernel's struct, its
map_info->btf_vmlinux_value_type_id is set. The first map type
supporting it is BPF_MAP_TYPE_STRUCT_OPS.
This series adds support to dump this kind of map with BTF.
The first two patches are bug fixes which are only applicable to
bpf-next.
Please see individual patches for details.
v3:
- Remove unnecessary #include "libbpf_internal.h" from patch 5
v2:
- Expose bpf_find_kernel_btf() as a LIBBPF_API in patch 3 (Andrii)
- Cache btf_vmlinux in bpftool/map.c (Andrii)
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch makes bpftool support dumping a map's value properly
when the map's value type is a type of the running kernel's btf.
(i.e. map_info.btf_vmlinux_value_type_id is set instead of
map_info.btf_value_type_id). The first usecase is for the
BPF_MAP_TYPE_STRUCT_OPS.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200115230044.1103008-1-kafai@fb.com
This patch adds BPF_MAP_TYPE_STRUCT_OPS to "struct_ops" name mapping
so that "bpftool map show" can print the "struct_ops" map type
properly.
[root@arch-fb-vm1 bpf]# ~/devshare/fb-kernel/linux/tools/bpf/bpftool/bpftool map show id 8
8: struct_ops name dctcp flags 0x0
key 4B value 256B max_entries 1 memlock 4096B
btf_id 7
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200115230037.1102674-1-kafai@fb.com
This patch exposes bpf_find_kernel_btf() as a LIBBPF_API.
It will be used in 'bpftool map dump' in a following patch
to dump a map with btf_vmlinux_value_type_id set.
bpf_find_kernel_btf() is renamed to libbpf_find_kernel_btf()
and moved to btf.c. As <linux/kernel.h> is included,
some of the max/min type casting needs to be fixed.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200115230031.1102305-1-kafai@fb.com
The btf availability check is only done for plain text output.
It causes the whole BTF output went missing when json_output
is used.
This patch simplifies the logic a little by avoiding passing "int btf" to
map_dump().
For plain text output, the btf_wtr is only created when the map has
BTF (i.e. info->btf_id != 0). The nullness of "json_writer_t *wtr"
in map_dump() alone can decide if dumping BTF output is needed.
As long as wtr is not NULL, map_dump() will print out the BTF-described
data whenever a map has BTF available (i.e. info->btf_id != 0)
regardless of json or plain-text output.
In do_dump(), the "int btf" is also renamed to "int do_plain_btf".
Fixes: 99f9863a0c ("bpftool: Match maps by name")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Cc: Paul Chaignon <paul.chaignon@orange.com>
Link: https://lore.kernel.org/bpf/20200115230025.1101828-1-kafai@fb.com
When testing a map has btf or not, maps_have_btf() tests it by actually
getting a btf_fd from sys_bpf(BPF_BTF_GET_FD_BY_ID). However, it
forgot to btf__free() it.
In maps_have_btf() stage, there is no need to test it by really
calling sys_bpf(BPF_BTF_GET_FD_BY_ID). Testing non zero
info.btf_id is good enough.
Also, the err_close case is unnecessary, and also causes double
close() because the calling func do_dump() will close() all fds again.
Fixes: 99f9863a0c ("bpftool: Match maps by name")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Cc: Paul Chaignon <paul.chaignon@orange.com>
Link: https://lore.kernel.org/bpf/20200115230019.1101352-1-kafai@fb.com
QEMU running in a stubdom needs to be able to set INTX_DISABLE, and the
MSI(-X) enable flags in the PCI config space. This adds an attribute
'allow_interrupt_control' which when set for a PCI device allows writes
to this flag(s). The toolstack will need to set this for stubdoms.
When enabled, guest (stubdomain) will be allowed to set relevant enable
flags, but only one at a time - i.e. it refuses to enable more than one
of INTx, MSI, MSI-X at a time.
This functionality is needed only for config space access done by device
model (stubdomain) serving a HVM with the actual PCI device. It is not
necessary and unsafe to enable direct access to those bits for PV domain
with the device attached. For PV domains, there are separate protocol
messages (XEN_PCI_OP_{enable,disable}_{msi,msix}) for this purpose.
Those ops in addition to setting enable bits, also configure MSI(-X) in
dom0 kernel - which is undesirable for PCI passthrough to HVM guests.
This should not introduce any new security issues since a malicious
guest (or stubdom) can already generate MSIs through other ways, see
[1] page 8. Additionally, when qemu runs in dom0, it already have direct
access to those bits.
This is the second iteration of this feature. First was proposed as a
direct Xen interface through a new hypercall, but ultimately it was
rejected by the maintainer, because of mixing pciback and hypercalls for
PCI config space access isn't a good design. Full discussion at [2].
[1]: https://invisiblethingslab.com/resources/2011/Software%20Attacks%20on%20Intel%20VT-d.pdf
[2]: https://xen.markmail.org/thread/smpgpws4umdzizze
[part of the commit message and sysfs handling]
Signed-off-by: Simon Gaiser <simon@invisiblethingslab.com>
[the rest]
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
[boris: A few small changes suggested by Roger, some formatting changes]
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
John Fastabend says:
====================
To date our usage of sockmap/tls has been fairly simple, the BPF programs
did only well-defined pop, push, pull and apply/cork operations.
Now that we started to push more complex programs into sockmap we uncovered
a series of issues addressed here. Further OpenSSL3.0 version should be
released soon with kTLS support so its important to get any remaining
issues on BPF and kTLS support resolved.
Additionally, I have a patch under development to allow sockmap to be
enabled/disabled at runtime for Cilium endpoints. This allows us to stress
the map insert/delete with kTLS more than previously where Cilium only
added the socket to the map when it entered ESTABLISHED state and never
touched it from the control path side again relying on the sockets own
close() hook to remove it.
To test I have a set of test cases in test_sockmap.c that expose these
issues. Once we get fixes here merged and in bpf-next I'll submit the
tests to bpf-next tree to ensure we don't regress again. Also I've run
these patches in the Cilium CI with OpenSSL (master branch) this will
run tools such as netperf, ab, wrk2, curl, etc. to get a broad set of
testing.
I'm aware of two more issues that we are working to resolve in another
couple (probably two) patches. First we see an auth tag corruption in
kTLS when sending small 1byte chunks under stress. I've not pinned this
down yet. But, guessing because its under 1B stress tests it must be
some error path being triggered. And second we need to ensure BPF RX
programs are not skipped when kTLS ULP is loaded. This breaks some of the
sockmap selftests when running with kTLS. I'll send a follow up for this.
v2: I dropped a patch that added !0 size check in tls_push_record
this originated from a panic I caught awhile ago with a trace
in the crypto stack. But I can not reproduce it anymore so will
dig into that and send another patch later if needed. Anyways
after a bit of thought it would be nicer if tls/crypto/bpf didn't
require special case handling for the !0 size.
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
When user returns SK_DROP we need to reset the number of copied bytes
to indicate to the user the bytes were dropped and not sent. If we
don't reset the copied arg sendmsg will return as if those bytes were
copied giving the user a positive return value.
This works as expected today except in the case where the user also
pops bytes. In the pop case the sg.size is reduced but we don't correctly
account for this when copied bytes is reset. The popped bytes are not
accounted for and we return a small positive value potentially confusing
the user.
The reason this happens is due to a typo where we do the wrong comparison
when accounting for pop bytes. In this fix notice the if/else is not
needed and that we have a similar problem if we push data except its not
visible to the user because if delta is larger the sg.size we return a
negative value so it appears as an error regardless.
Fixes: 7246d8ed4d ("bpf: helper to pop data from messages")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/bpf/20200111061206.8028-9-john.fastabend@gmail.com
Its possible through a set of push, pop, apply helper calls to construct
a skmsg, which is just a ring of scatterlist elements, with the start
value larger than the end value. For example,
end start
|_0_|_1_| ... |_n_|_n+1_|
Where end points at 1 and start points and n so that valid elements is
the set {n, n+1, 0, 1}.
Currently, because we don't build the correct chain only {n, n+1} will
be sent. This adds a check and sg_chain call to correctly submit the
above to the crypto and tls send path.
Fixes: d3b18ad31f ("tls: add bpf support to sk_msg handling")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/bpf/20200111061206.8028-8-john.fastabend@gmail.com
It is possible to build a plaintext buffer using push helper that is larger
than the allocated encrypt buffer. When this record is pushed to crypto
layers this can result in a NULL pointer dereference because the crypto
API expects the encrypt buffer is large enough to fit the plaintext
buffer. Kernel splat below.
To resolve catch the cases this can happen and split the buffer into two
records to send individually. Unfortunately, there is still one case to
handle where the split creates a zero sized buffer. In this case we merge
the buffers and unmark the split. This happens when apply is zero and user
pushed data beyond encrypt buffer. This fixes the original case as well
because the split allocated an encrypt buffer larger than the plaintext
buffer and the merge simply moves the pointers around so we now have
a reference to the new (larger) encrypt buffer.
Perhaps its not ideal but it seems the best solution for a fixes branch
and avoids handling these two cases, (a) apply that needs split and (b)
non apply case. The are edge cases anyways so optimizing them seems not
necessary unless someone wants later in next branches.
[ 306.719107] BUG: kernel NULL pointer dereference, address: 0000000000000008
[...]
[ 306.747260] RIP: 0010:scatterwalk_copychunks+0x12f/0x1b0
[...]
[ 306.770350] Call Trace:
[ 306.770956] scatterwalk_map_and_copy+0x6c/0x80
[ 306.772026] gcm_enc_copy_hash+0x4b/0x50
[ 306.772925] gcm_hash_crypt_remain_continue+0xef/0x110
[ 306.774138] gcm_hash_crypt_continue+0xa1/0xb0
[ 306.775103] ? gcm_hash_crypt_continue+0xa1/0xb0
[ 306.776103] gcm_hash_assoc_remain_continue+0x94/0xa0
[ 306.777170] gcm_hash_assoc_continue+0x9d/0xb0
[ 306.778239] gcm_hash_init_continue+0x8f/0xa0
[ 306.779121] gcm_hash+0x73/0x80
[ 306.779762] gcm_encrypt_continue+0x6d/0x80
[ 306.780582] crypto_gcm_encrypt+0xcb/0xe0
[ 306.781474] crypto_aead_encrypt+0x1f/0x30
[ 306.782353] tls_push_record+0x3b9/0xb20 [tls]
[ 306.783314] ? sk_psock_msg_verdict+0x199/0x300
[ 306.784287] bpf_exec_tx_verdict+0x3f2/0x680 [tls]
[ 306.785357] tls_sw_sendmsg+0x4a3/0x6a0 [tls]
test_sockmap test signature to trigger bug,
[TEST]: (1, 1, 1, sendmsg, pass,redir,start 1,end 2,pop (1,2),ktls,):
Fixes: d3b18ad31f ("tls: add bpf support to sk_msg handling")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/bpf/20200111061206.8028-7-john.fastabend@gmail.com
Leaving an incorrect end mark in place when passing to crypto
layer will cause crypto layer to stop processing data before
all data is encrypted. To fix clear the end mark on push
data instead of expecting users of the helper to clear the
mark value after the fact.
This happens when we push data into the middle of a skmsg and
have room for it so we don't do a set of copies that already
clear the end flag.
Fixes: 6fff607e2f ("bpf: sk_msg program helper bpf_msg_push_data")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/bpf/20200111061206.8028-6-john.fastabend@gmail.com
In the push, pull, and pop helpers operating on skmsg objects to make
data writable or insert/remove data we use this bounds check to ensure
specified data is valid,
/* Bounds checks: start and pop must be inside message */
if (start >= offset + l || last >= msg->sg.size)
return -EINVAL;
The problem here is offset has already included the length of the
current element the 'l' above. So start could be past the end of
the scatterlist element in the case where start also points into an
offset on the last skmsg element.
To fix do the accounting slightly different by adding the length of
the previous entry to offset at the start of the iteration. And
ensure its initialized to zero so that the first iteration does
nothing.
Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Fixes: 6fff607e2f ("bpf: sk_msg program helper bpf_msg_push_data")
Fixes: 7246d8ed4d ("bpf: helper to pop data from messages")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/bpf/20200111061206.8028-5-john.fastabend@gmail.com
When sockmap sock with TLS enabled is removed we cleanup bpf/psock state
and call tcp_update_ulp() to push updates to TLS ULP on top. However, we
don't push the write_space callback up and instead simply overwrite the
op with the psock stored previous op. This may or may not be correct so
to ensure we don't overwrite the TLS write space hook pass this field to
the ULP and have it fixup the ctx.
This completes a previous fix that pushed the ops through to the ULP
but at the time missed doing this for write_space, presumably because
write_space TLS hook was added around the same time.
Fixes: 95fa145479 ("bpf: sockmap/tls, close can race with map free")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/bpf/20200111061206.8028-4-john.fastabend@gmail.com
The sock_map_free() and sock_hash_free() paths used to delete sockmap
and sockhash maps walk the maps and destroy psock and bpf state associated
with the socks in the map. When done the socks no longer have BPF programs
attached and will function normally. This can happen while the socks in
the map are still "live" meaning data may be sent/received during the walk.
Currently, though we don't take the sock_lock when the psock and bpf state
is removed through this path. Specifically, this means we can be writing
into the ops structure pointers such as sendmsg, sendpage, recvmsg, etc.
while they are also being called from the networking side. This is not
safe, we never used proper READ_ONCE/WRITE_ONCE semantics here if we
believed it was safe. Further its not clear to me its even a good idea
to try and do this on "live" sockets while networking side might also
be using the socket. Instead of trying to reason about using the socks
from both sides lets realize that every use case I'm aware of rarely
deletes maps, in fact kubernetes/Cilium case builds map at init and
never tears it down except on errors. So lets do the simple fix and
grab sock lock.
This patch wraps sock deletes from maps in sock lock and adds some
annotations so we catch any other cases easier.
Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/bpf/20200111061206.8028-3-john.fastabend@gmail.com