commit eee22187b5 upstream.
In do_writepages, if the value returned by ext4_writepages is "-ENOMEM"
and "wbc->sync_mode == WB_SYNC_ALL", retry until the condition is not met.
In __ext4_get_inode_loc, if the bh returned by sb_getblk is NULL,
the function returns -ENOMEM.
In __getblk_slow, if the return value of grow_buffers is less than 0,
the function returns NULL.
When the three processes are connected in series like the following stack,
an infinite loop may occur:
do_writepages <--- keep retrying
ext4_writepages
mpage_map_and_submit_extent
mpage_map_one_extent
ext4_map_blocks
ext4_ext_map_blocks
ext4_ext_handle_unwritten_extents
ext4_ext_convert_to_initialized
ext4_split_extent
ext4_split_extent_at
__ext4_ext_dirty
__ext4_mark_inode_dirty
ext4_reserve_inode_write
ext4_get_inode_loc
__ext4_get_inode_loc <--- return -ENOMEM
sb_getblk
__getblk_gfp
__getblk_slow <--- return NULL
grow_buffers
grow_dev_page <--- return -ENXIO
ret = (block < end_block) ? 1 : -ENXIO;
In this issue, bg_inode_table_hi is overwritten as an incorrect value.
As a result, `block < end_block` cannot be met in grow_dev_page.
Therefore, __ext4_get_inode_loc always returns '-ENOMEM' and do_writepages
keeps retrying. As a result, the writeback process is in the D state due
to an infinite loop.
Add a check on inode table block in the __ext4_get_inode_loc function by
referring to ext4_read_inode_bitmap to avoid this infinite loop.
Cc: stable@kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220817132701.3015912-3-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Bug: 264629802
Bug: 264632463
Change-Id: Id3bb71336059cac33f16fca383e783add3a01295
Signed-off-by: Tudor Ambarus <tudor.ambarus@linaro.org>
commit 5c61795ea9 upstream.
This debug statement was never meant to go into the upstream release,
kill it off before it ends up in a release. It was just part of the
testing for the initial version of the patch.
Fixes: 2ec33a6c3c ("io_uring/rw: ensure kiocb_end_write() is always called")
Change-Id: Iee9f436c34cc137a7ab934aafa3aa0c584369418
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Bug: 268174392
(cherry picked from commit e699cce29a)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 2ec33a6c3c upstream.
A previous commit moved the notifications and end-write handling, but
it is now missing a few spots where we also want to call both of those.
Without that, we can potentially be missing file notifications, and
more importantly, have an imbalance in the super_block writers sem
accounting.
Fixes: b000145e99 ("io_uring/rw: defer fsnotify calls to task context")
Reported-by: Dave Chinner <david@fromorbit.com>
Link: https://lore.kernel.org/all/20221010050319.GC2703033@dread.disaster.area/
Change-Id: Iaaa509f5dadcae04f58c929901225bc968b35d52
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Bug: 268174392
(cherry picked from commit 3d5f181bda)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 3e4cb6ebbb upstream.
I hit a very bad problem during my tests of SENDMSG_ZC.
BUG(); in first_iovec_segment() triggered very easily.
The problem was io_setup_async_msg() in the partial retry case,
which seems to happen more often with _ZC.
iov_iter_iovec_advance() may change i->iov in order to have i->iov_offset
being only relative to the first element.
Which means kmsg->msg.msg_iter.iov is no longer the
same as kmsg->fast_iov.
But this would rewind the copy to be the start of
async_msg->fast_iov, which means the internal
state of sync_msg->msg.msg_iter is inconsitent.
I tested with 5 vectors with length like this 4, 0, 64, 20, 8388608
and got a short writes with:
- ret=2675244 min_ret=8388692 => remaining 5713448 sr->done_io=2675244
- ret=-EAGAIN => io_uring_poll_arm
- ret=4911225 min_ret=5713448 => remaining 802223 sr->done_io=7586469
- ret=-EAGAIN => io_uring_poll_arm
- ret=802223 min_ret=802223 => res=8388692
While this was easily triggered with SENDMSG_ZC (queued for 6.1),
it was a potential problem starting with 7ba89d2af1
in 5.18 for IORING_OP_RECVMSG.
And also with 4c3c09439c in 5.19
for IORING_OP_SENDMSG.
However 257e84a537 introduced the critical
code into io_setup_async_msg() in 5.11.
Fixes: 7ba89d2af1 ("io_uring: ensure recv and recvmsg handle MSG_WAITALL correctly")
Fixes: 257e84a537 ("io_uring: refactor sendmsg/recvmsg iov managing")
Cc: stable@vger.kernel.org
Change-Id: I72c459fdbae2938d176126ed2f17eea990c42d49
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b2e7be246e2fb173520862b0c7098e55767567a2.1664436949.git.metze@samba.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Bug: 268174392
(cherry picked from commit fc2491562a)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 6f83ab22ad upstream.
-1 tells use to use the current position, but we check if the file is
a stream regardless of that. Fix up io_kiocb_update_pos() to only
dip into file if we need to. This is both more efficient and also drops
12 bytes of text on aarch64 and 64 bytes on x86-64.
Fixes: b4aec40015 ("io_uring: do not recalculate ppos unnecessarily")
Change-Id: I5c22ce8122b0e1f0ad423a5b3aa520ee416feff1
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Bug: 268174392
(cherry picked from commit 89a77271d2)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit b4aec40015 upstream.
There is a slight optimisation to be had by calculating the correct pos
pointer inside io_kiocb_update_pos and then using that later.
It seems code size drops by a bit:
000000000000a1b0 0000000000000400 t io_read
000000000000a5b0 0000000000000319 t io_write
vs
000000000000a1b0 00000000000003f6 t io_read
000000000000a5b0 0000000000000310 t io_write
Change-Id: I19d8cdb6ea88d8fc4625e521363d5a8f638dfdcb
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Bug: 268174392
(cherry picked from commit e90cfb9699)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit d34e1e5b39 upstream.
Update kiocb->ki_pos at execution time rather than in io_prep_rw().
io_prep_rw() happens before the job is enqueued to a worker and so the
offset might be read multiple times before being executed once.
Ensures that the file position in a set of _linked_ SQEs will be only
obtained after earlier SQEs have completed, and so will include their
incremented file position.
Change-Id: I3c5abbf6a337ec1958fd6600c5feb44fb61a5772
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Bug: 268174392
(cherry picked from commit ea528ecac3)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit af9c45eceb upstream.
io_kiocb_ppos is called in both branches, and it seems that the compiler
does not fuse this. Fusing removes a few bytes from loop_rw_iter.
Before:
$ nm -S fs/io_uring.o | grep loop_rw_iter
0000000000002430 0000000000000124 t loop_rw_iter
After:
$ nm -S fs/io_uring.o | grep loop_rw_iter
0000000000002430 000000000000010d t loop_rw_iter
Change-Id: Ibd662d59697d9cb1e484319050f6e5f960f6ac5c
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Bug: 268174392
(cherry picked from commit 076f872314)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit e775f93f2a upstream.
io_uring caches task references to avoid doing atomics for each of them
per request. If a request is put from the same task that allocated it,
then we can maintain a per-ctx cache of them. This obviously relies
on io_uring always pruning caches in a reliable way, and there's
currently a case off io_uring fd release where we can miss that.
One example is a ring setup with IOPOLL, which relies on the task
polling for completions, which will free them. However, if such a task
submits a request and then exits or closes the ring without reaping
the completion, then ring release will reap and put. If release happens
from that very same task, the completed request task refs will get
put back into the cache pool. This is problematic, as we're now beyond
the point of pruning caches.
Manually drop these caches after doing an IOPOLL reap. This releases
references from the current task, which is enough. If another task
happens to be doing the release, then the caching will not be
triggered and there's no issue.
Cc: stable@vger.kernel.org
Fixes: e98e49b2bb ("io_uring: extend task put optimisations")
Reported-by: Homin Rhee <hominlab@gmail.com>
Change-Id: I9495121af065424141fa9c39840ab9aa91f45c72
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Bug: 268174392
(cherry picked from commit e9c6556708)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 544d163d65 upstream.
syzbot reports an issue with overflow filling for IOPOLL:
WARNING: CPU: 0 PID: 28 at io_uring/io_uring.c:734 io_cqring_event_overflow+0x1c0/0x230 io_uring/io_uring.c:734
CPU: 0 PID: 28 Comm: kworker/u4:1 Not tainted 6.2.0-rc3-syzkaller-16369-g358a161a6a9e #0
Workqueue: events_unbound io_ring_exit_work
Call trace:
io_cqring_event_overflow+0x1c0/0x230 io_uring/io_uring.c:734
io_req_cqe_overflow+0x5c/0x70 io_uring/io_uring.c:773
io_fill_cqe_req io_uring/io_uring.h:168 [inline]
io_do_iopoll+0x474/0x62c io_uring/rw.c:1065
io_iopoll_try_reap_events+0x6c/0x108 io_uring/io_uring.c:1513
io_uring_try_cancel_requests+0x13c/0x258 io_uring/io_uring.c:3056
io_ring_exit_work+0xec/0x390 io_uring/io_uring.c:2869
process_one_work+0x2d8/0x504 kernel/workqueue.c:2289
worker_thread+0x340/0x610 kernel/workqueue.c:2436
kthread+0x12c/0x158 kernel/kthread.c:376
ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:863
There is no real problem for normal IOPOLL as flush is also called with
uring_lock taken, but it's getting more complicated for IOPOLL|SQPOLL,
for which __io_cqring_overflow_flush() happens from the CQ waiting path.
Reported-and-tested-by: syzbot+6805087452d72929404e@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org # 5.10+
Change-Id: I3449b2ea1b71ff2f04f119741751b42870386923
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Bug: 268174392
(cherry picked from commit de77faee28)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit 343190841a ]
We only check the register opcode value inside the restricted ring
section, move it into the main io_uring_register() function instead
and check it up front.
Change-Id: I4b5f782dad48eb0e7f04d5956cc087494e02b2ec
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Bug: 268174392
(cherry picked from commit 78e8151f04)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit a73825ba70 upstream.
Do not set REQ_F_NOWAIT if the socket is non blocking. When enabled this
causes the accept to immediately post a CQE with EAGAIN, which means you
cannot perform an accept SQE on a NONBLOCK socket asynchronously.
By removing the flag if there is no pending accept then poll is armed as
usual and when a connection comes in the CQE is posted.
Change-Id: I0fae3f75c7fbbf44f85da7d83f48c4cfed1fcae9
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220324143435.2875844-1-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Bug: 268174392
(cherry picked from commit aa4c9b3e45)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 10c873334f upstream.
We currently check REQ_F_POLLED before arming async poll for a
notification to retry. If it's set, then we don't allow poll and will
punt to io-wq instead. This is done to prevent a situation where a buggy
driver will repeatedly return that there's space/data available yet we
get -EAGAIN.
However, if we already transferred data, then it should be safe to rely
on poll again. Gate the check on whether or not REQ_F_PARTIAL_IO is
also set.
Change-Id: I36b6d16ac43202fdf9ae5eea64f9dfbcfbe7fee5
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Bug: 268174392
(cherry picked from commit 4bc17e6381)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 4c3c09439c upstream.
Like commit 7ba89d2af1 for recv/recvmsg, support MSG_WAITALL for the
send side. If this flag is set and we do a short send, retry for a
stream of seqpacket socket.
Change-Id: If67a4462576af1b683d53d2dc0d46e44c9dd8863
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Bug: 268174392
(cherry picked from commit f901b4bfd0)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 8a3e8ee564 upstream.
If we need to continue doing this IO, then we don't want a potentially
selected buffer recycled. Add a flag for that.
Set this for recv/recvmsg if they do partial IO.
Change-Id: If9381bd6a5695c8c85c7a51c3adccc0dc09f8999
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Bug: 268174392
(cherry picked from commit 96ccba4a1a)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 7ba89d2af1 upstream.
We currently don't attempt to get the full asked for length even if
MSG_WAITALL is set, if we get a partial receive. If we do see a partial
receive, then just note how many bytes we did and return -EAGAIN to
get it retried.
The iov is advanced appropriately for the vector based case, and we
manually bump the buffer and remainder for the non-vector case.
Cc: stable@vger.kernel.org
Reported-by: Constantine Gavrilov <constantine.gavrilov@gmail.com>
Change-Id: I618bde7c86b29f6053dd8cd19682f2916e57dd54
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Bug: 268174392
(cherry picked from commit aadd9b0930)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 46a525e199 upstream.
This isn't a reliable mechanism to tell if we have task_work pending, we
really should be looking at whether we have any items queued. This is
problematic if forward progress is gated on running said task_work. One
such example is reading from a pipe, where the write side has been closed
right before the read is started. The fput() of the file queues TWA_RESUME
task_work, and we need that task_work to be run before ->release() is
called for the pipe. If ->release() isn't called, then the read will sit
forever waiting on data that will never arise.
Fix this by io_run_task_work() so it checks if we have task_work pending
rather than rely on TIF_NOTIFY_SIGNAL for that. The latter obviously
doesn't work for task_work that is queued without TWA_SIGNAL.
Reported-by: Christiano Haesbaert <haesbaert@haesbaert.org>
Cc: stable@vger.kernel.org
Link: https://github.com/axboe/liburing/issues/665
Change-Id: I042b07491afac06692639d91bdf7dd21a2405651
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Bug: 268174392
(cherry picked from commit 2fd232bbd6)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
We currently have 3 different ways that __iommu_probe_device() may be
called, but no real guarantee that multiple callers can't tread on each
other, especially once asynchronous driver probe gets involved. It would
likely have taken a fair bit of luck to hit this previously, but commit
57365a04c9 ("iommu: Move bus setup to IOMMU device registration") ups
the odds since now it's not just omap-iommu that may trigger multiple
bus_iommu_probe() calls in parallel if probing asynchronously.
Add a lock to ensure we can't try to double-probe a device, and also
close some possible race windows to make sure we're truly robust against
trying to double-initialise a group via two different member devices.
Reported-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Brian Norris <briannorris@chromium.org>
Fixes: 57365a04c9 ("iommu: Move bus setup to IOMMU device registration")
Link: https://lore.kernel.org/r/1946ef9f774851732eed78760a78ec40dbc6d178.1667591503.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Bug: 269232600
(cherry picked from commit 01657bc14a)
Change-Id: Ie87f8f7a7b90431c3a2682923961885ce7b239f3
Signed-off-by: Zhenhua Huang <quic_zhenhuah@quicinc.com>
commit e6db6f9398 upstream.
We have two types of task_work based creation, one is using an existing
worker to setup a new one (eg when going to sleep and we have no free
workers), and the other is allocating a new worker. Only the latter
should be freed when we cancel task_work creation for a new worker.
Fixes: af82425c6a ("io_uring/io-wq: free worker if task_work creation is canceled")
Reported-by: syzbot+d56ec896af3637bdb7e4@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Bug: 268174392
(cherry picked from commit a88a0d16e1)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I75c9b22dce02151b2687cf90d6c5b74c08d0f04b
commit af82425c6a upstream.
If we cancel the task_work, the worker will never come into existance.
As this is the last reference to it, ensure that we get it freed
appropriately.
Cc: stable@vger.kernel.org
Reported-by: 진호 <wnwlsgh98@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Bug: 268174392
(cherry picked from commit b912ed1363)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Iacfd7a5db15c417fd1f02c85e414e3137e8729ec
The signal.c can't use heap for bit data located on stack. However,
by default a compiler warns us about overstepping stack frame size
threshold:
arch/um/os-Linux/signal.c: In function ‘sig_handler_common’:
arch/um/os-Linux/signal.c:51:1: warning: the frame size of 2960 bytes is larger than 2048 bytes [-Wframe-larger-than=]
51 | }
| ^
arch/um/os-Linux/signal.c: In function ‘timer_real_alarm_handler’:
arch/um/os-Linux/signal.c:95:1: warning: the frame size of 2960 bytes is larger than 2048 bytes [-Wframe-larger-than=]
95 | }
| ^
Due to above increase stack frame size threshold explicitly for signal.c
to avoid unnecessary warning.
Bug: 269057599
Change-Id: Ib7474bddfefa97f9c60087db6a607a111e4d23bc
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Tested-by: David Gow <davidgow@google.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
(cherry picked from commit 517f60206e)
Signed-off-by: Srinivasarao Pathipati <quic_spathi@quicinc.com>
Add a vendor hook to arch_setup_dma_ops to allow vendors to perform
any necessary post-actions on setting up DMA ops for a given device,
focusing mainly on enabling those to opt-in for the Cortex-A510
erratum 2454944.
Bug: 263236925
Change-Id: I6fd4d3a30829437fc113ec15ca2e5d060a38e60c
Signed-off-by: Beata Michalska <beata.michalska@arm.com>
Cortex-A510 erratum 2454944 may cause clean cache lines to be
erroneously written back to memory, breaking the assumptions we rely on
for non-coherent DMA. Try to mitigate this by implementing special DMA
ops that do their best to avoid cacheable aliases via a combination of
bounce-buffering and manipulating the linear map directly, to minimise
the chance of DMA-mapped pages being speculated back into caches.
The other main concern is initial entry, where cache lines covering the
kernel image might potentially become affected between being cleaned by
the bootloader and the kernel being called, which might require additional
cache maintenance from the bootloader to be safe in that regard too.
Cortex-A510 supports S2FWB, so KVM should be unaffected.
For the workaround to be applied, it needs to be explicitly requested
through dedicated arm64_noalias_setup_dma_ops callback.
Bug: 223346425
(cherry picked from commit 683efc5fc6eeb653caf85c33a2fb92a33c8faa75
https://git.gitlab.arm.com/linux-arm/linux-rm.git arm64/2454944-dev)
Change-Id: If76b97dc39c278edb80f9b750129975ab2ac563e
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
[BM: Stripping-down the original solution by removing support for
cpu capabilities and ammending relevant bits, with the final
version being reduced to dedicated DMA ops with dependencies on
rodata_full being enabled (CONFIG_RODATA_FULL_DEFAULT_ENABLED),
swiotlb late init and disabling lazy tlb flushing.
Also, as a consequence, reducing debugging support.]
Signed-off-by: Beata Michalska <beata.michalska@arm.com>
Add an interface to disable lazy vunmap by forcing the threshold
to zero. This might be interesting for debugging/testing in general,
but primarily helps a horrible situation which needs to guarantee
that vmalloc aliases are up-to-date from atomic context, wherein
the only practical solution is to never let them get stale in
the first place.
Bug: 223346425
(cherry picked from commit 2a34c1503b85f49dd472dfd932dfcd16cab8ee8a
https://git.gitlab.arm.com/linux-arm/linux-rm.git arm64/2454944-dev)
Change-Id: I12fbbe3903f76a028ceea91ed078f0de2abe3815
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
[BM: Convert to a flag that can be explicitly modified at runtime
instead of relying on arch specific bits]
Signed-off-by: Beata Michalska <beata.michalska@arm.com>
In the 5.10.162 release, the io_uring code was synced with the version
that is in the 5.15.y kernel tree in order to resolve a huge number of
potential, and known, problems with the codebase. This makes for a more
secure and easier-to-update-and-maintain 5.10.y kernel tree, so this is
a great thing, however this caused some issues when it comes to the
Android KABI preservation and checking tools.
A number of the io_uring structures get used in other core kernel
structures, only as "opaque" pointers, so there is not any real ABI
breakage. But, due to the visibility of the structures going away, the
CRC values of many scheduler variables and functions were changed.
In order to preserve the CRC values, to prevent all device kernels to be
forced to rebuild for no reason whatsoever from a functional point of
view, we need to keep around the "old" io_uring structures for the CRC
calculation only. This is done by the following definitions of struct
io_identity and struct io_uring_task which will only be visible when the
CRC calculation build happens, not in any functional kernel build.
Yes, this all is a horrible hack, and these really are not the true
structures that any code uses, but so life is in the world of stable
apis.
Bug: 161946584
Bug: 268174392
Fixes: 788d082426 ("io_uring: import 5.15-stable io_uring")
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I2294f220ae78fe9aa32ee25b81829ae765e9deb2
In commit 788d082426 ("io_uring: import 5.15-stable io_uring"), a new
field was added to struct task_struct. Move it to the proper location
and macro in order to preserve the kernel ABI.
Bug: 161946584
Bug: 268174392
Fixes: 788d082426 ("io_uring: import 5.15-stable io_uring")
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ib2f65b7c1a973794b7ab525a9304f666ffebc9ee
In commit a3025359ff ("net: remove cmsg restriction from io_uring
based send/recvmsg calls") the flags variable was removed from struct
proto_ops as it is no longer needed.
But the ABI signatures break, so put it back to preserve this, there's
no functional change here.
Bug: 161946584
Bug: 268174392
Fixes: a3025359ff ("net: remove cmsg restriction from io_uring based send/recvmsg calls")
Change-Id: Ic6a868f038701a61c993e18b44cdd8ec8b0a4d58
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit 4464853277 ]
Pass in EPOLL_URING_WAKE when signaling eventfd or doing poll related
wakups, so that we can check for a circular event dependency between
eventfd and epoll. If this flag is set when our wakeup handlers are
called, then we know we have a dependency that needs to terminate
multishot requests.
eventfd and epoll are the only such possible dependencies.
Bug: 268174392
Cc: stable@vger.kernel.org # 6.0
Change-Id: I6e45fa1484657bd5caad007783785c2ee97a9929
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 189556b05e)
Bug: 268174392
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit 03e02acda8 ]
This is identical to eventfd_signal(), but it allows the caller to pass
in a mask to be used for the poll wakeup key. The use case is avoiding
repeated multishot triggers if we have a dependency between eventfd and
io_uring.
If we setup an eventfd context and register that as the io_uring eventfd,
and at the same time queue a multishot poll request for the eventfd
context, then any CQE posted will repeatedly trigger the multishot request
until it terminates when the CQ ring overflows.
In preparation for io_uring detecting this circular dependency, add the
mentioned helper so that io_uring can pass in EPOLL_URING as part of the
poll wakeup key.
Cc: stable@vger.kernel.org # 6.0
[axboe: fold in !CONFIG_EVENTFD fix from Zhang Qilong]
Change-Id: I0c38a56887777f85cb10673b7ca3b5ca4d70c61b
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 4ef66581d7)
Bug: 268174392
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit caf1aeaffc ]
We can have dependencies between epoll and io_uring. Consider an epoll
context, identified by the epfd file descriptor, and an io_uring file
descriptor identified by iofd. If we add iofd to the epfd context, and
arm a multishot poll request for epfd with iofd, then the multishot
poll request will repeatedly trigger and generate events until terminated
by CQ ring overflow. This isn't a desired behavior.
Add EPOLL_URING so that io_uring can pass it in as part of the poll wakeup
key, and io_uring can check for that to detect a potential recursive
invocation.
Cc: stable@vger.kernel.org # 6.0
Change-Id: Ifafcb236b2cfe3ca3e7254a0155625fce00fd038
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 2f09377502)
Bug: 268174392
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit e54937963f ]
No need to restrict these anymore, as the worker threads are direct
clones of the original task. Hence we know for a fact that we can
support anything that the regular task can.
Since the only user of proto_ops->flags was to flag PROTO_CMSG_DATA_ONLY,
kill the member and the flag definition too.
Change-Id: Ie87e4ff3c621cf53a8e9589a7689e62d759de983
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit a3025359ff)
Bug: 268174392
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit 35d0b389f3 ]
Song reported a boot regression in a kvm image with 5.11-rc, and bisected
it down to the below patch. Debugging this issue, turns out that the boot
stalled when a task is waiting on a pipe being released. As we no longer
run task_work from get_signal() unless it's queued with TWA_SIGNAL, the
task goes idle without running the task_work. This prevents ->release()
from being called on the pipe, which another boot task is waiting on.
For now, re-instate the unconditional task_work run from get_signal().
For 5.12, we'll collapse TWA_RESUME and TWA_SIGNAL, as it no longer
makes sense to have a distinction between the two. This will turn
task_work notification into a simple boolean, whether to notify or not.
Fixes: 98b89b649f ("signal: kill JOBCTL_TASK_WORK")
Reported-by: Song Liu <songliubraving@fb.com>
Tested-by: John Stultz <john.stultz@linaro.org>
Tested-by: Douglas Anderson <dianders@chromium.org>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com> # LLVM/Clang version 11.0.1
Change-Id: Id5ce292120cafff9ede9bb7421cde3aaf4e56924
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 6ef2b4728a)
Bug: 268174392
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
No upstream commit exists.
This imports the io_uring codebase from 5.15.85, wholesale. Changes
from that code base:
- Drop IOCB_ALLOC_CACHE, we don't have that in 5.10.
- Drop MKDIRAT/SYMLINKAT/LINKAT. Would require further VFS backports,
and we don't support these in 5.10 to begin with.
- sock_from_file() old style calling convention.
- Use compat_get_bitmap() only for CONFIG_COMPAT=y
Change-Id: I7ce5226d6b39763ffc246fd6357cece9aafd4b59
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 788d082426)
Bug: 268174392
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit c7aab1a7c5 ]
The only exported helper we have right now is task_work_cancel(), which
cancels any task_work from a given task where func matches the queued
work item. This is a bit too coarse for some use cases. Add a
task_work_cancel_match() that allows to more specifically target
individual work items outside of purely the callback function used.
task_work_cancel() can be trivially implemented on top of that, hence do
so.
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Change-Id: Ia33480d209b26d433a3ca196972d6931aa4f8dde
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit ed30050329)
Bug: 268174392
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit 10442994ba ]
Right now we're never calling get_signal() from PF_IO_WORKER threads, but
in preparation for doing so, don't handle a fatal signal for them. The
workers have state they need to cleanup when exiting, so just return
instead of calling do_exit() on their behalf. The threads themselves will
detect a fatal signal and do proper shutdown.
Change-Id: Iedc3fae8cb496d003852c87fdefacc1ad7601cc5
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 831cb78a2a)
Bug: 268174392
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>