ovl_inode_lock() is interruptible. When inode_lock() in ovl_llseek()
was replaced with ovl_inode_lock(), we did not add a check for error.
Fix this by making ovl_inode_lock() uninterruptible and change the
existing call sites to use an _interruptible variant.
Reported-by: syzbot+66a9752fa927f745385e@syzkaller.appspotmail.com
Fixes: b1f9d3858f ("ovl: use ovl_inode_lock in ovl_llseek()")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Since snprintf() returns the would-be-output size instead of the
actual output size, the succeeding calls may go beyond the given
buffer limit. Fix it by replacing with scnprintf().
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In xchk_xattr_listent, we attempt to validate the extended attribute
hash structures by performing a attr lookup by (hashed) name. If the
lookup returns ENODATA, that means that the hash information is corrupt.
The _process_error functions don't catch this, so we have to add that
explicitly.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
In xchk_dir_actor, we attempt to validate the directory hash structures
by performing a directory entry lookup by (hashed) name. If the lookup
returns ENOENT, that means that the hash information is corrupt. The
_process_error functions don't catch this, so we have to add that
explicitly.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Check the owner field of dir3 block headers. If it's corrupt, release
the buffer and return EFSCORRUPTED. All callers handle this properly.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Check the owner field of dir3 data block headers. If it's corrupt,
release the buffer and return EFSCORRUPTED. All callers handle this
properly.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
If we decide that a directory free block is corrupt, we must take care
not to leak a buffer pointer to the caller. After xfs_trans_brelse
returns, the buffer can be freed or reused, which means that we have to
set *bpp back to NULL.
Callers are supposed to notice the nonzero return value and not use the
buffer pointer, but we should code more defensively, even if all current
callers handle this situation correctly.
Fixes: de14c5f541 ("xfs: verify free block header fields")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
xfs_verifier_error is supposed to be called on a corrupt metadata buffer
from within a buffer verifier function, whereas xfs_buf_mark_corrupt
is the function to be called when a piece of code has read a buffer and
catches something that a read verifier cannot. The first function sets
b_error anticipating that the low level buffer handling code will see
the nonzero b_error and clear XBF_DONE on the buffer, whereas the second
function does not.
Since xfs_dir3_free_header_check examines fields in the dir free block
header that require more context than can be provided to read verifiers,
we must call xfs_buf_mark_corrupt when it finds a problem.
Switching the calls has a secondary effect that we no longer corrupt the
buffer state by setting b_error and leaving XBF_DONE set. When /that/
happens, we'll trip over various state assertions (most commonly the
b_error check in xfs_buf_reverify) on a subsequent attempt to read the
buffer.
Fixes: bc1a09b8e3 ("xfs: refactor verifier callers to print address of failing check")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Add a xfs_failaddr_t parameter to this function so that callers can
potentially pass in (and therefore report) the exact point in the code
where we decided that a metadata buffer was corrupt. This enables us to
wire it up to checking functions that have to run outside of verifiers.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Add a helper function to get rid of buffers that we have decided are
corrupt after the verifiers have run. This function is intended to
handle metadata checks that can't happen in the verifiers, such as
inter-block relationship checking. Note that we now mark the buffer
stale so that it will not end up on any LRU and will be purged on
release.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
In e7ee96dfb8, we converted all ITER_ABORT users to use ECANCELED
instead, but we forgot to teach xfs_rmap_has_other_keys not to return
that magic value to callers. Fix it now by using ECANCELED both to
abort the iteration and to signal that we found another reverse mapping.
This enables us to drop the separate boolean flag.
Fixes: e7ee96dfb8 ("xfs: remove all *_ITER_ABORT values")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Log the corrupt buffer before we release the buffer.
Fixes: a5155b870d ("xfs: always log corruption errors")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Deduplicate cancellation parts, as many of them looks the same, as do
e.g.
- io_wqe_cancel_cb_work() and io_wqe_cancel_work()
- io_wq_worker_cancel() and io_work_cancel()
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull fscrypt fix from Eric Biggers:
"Fix a bug where if userspace is writing to encrypted files while the
FS_IOC_REMOVE_ENCRYPTION_KEY ioctl (introduced in v5.4) is running,
dirty inodes could be evicted, causing writes could be lost or the
filesystem to hang due to a use-after-free. This was encountered
during real-world use, not just theoretical.
Tested with the existing fscrypt xfstests, and with a new xfstest I
wrote to reproduce this bug. This fix does expose an existing bug with
'-o lazytime' that Ted is working on fixing, but this fix is more
critical and needed anyway regardless of the lazytime fix"
* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
fscrypt: don't evict dirty inodes after removing key
Ensure we keep the truncated value, if we did truncate it. If not, we
might read/write more than the registered buffer size.
Also for retry, ensure that we return the truncated mapped value for
the vectorized versions of the read/write commands.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
struct xfs_agfl is a header in front of the AGFL entries that exists
for CRC enabled file systems. For not CRC enabled file systems the AGFL
is simply a list of agbno. Make the CRC case similar to that by just
using the list behind the new header. This indirectly solves a problem
with modern gcc versions that warn about taking addresses of packed
structures (and we have to pack the AGFL given that gcc rounds up
structure sizes). Also replace the helper macro to get from a buffer
with an inline function in xfs_alloc.h to make the code easier to
read.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Leaving PF_MEMALLOC set when exiting a kthread causes it to remain set
during do_exit(). That can confuse things. In particular, if BSD
process accounting is enabled, then do_exit() writes data to an
accounting file. If that file has FS_SYNC_FL set, then this write
occurs synchronously and can misbehave if PF_MEMALLOC is set.
For example, if the accounting file is located on an XFS filesystem,
then a WARN_ON_ONCE() in iomap_do_writepage() is triggered and the data
doesn't get written when it should. Or if the accounting file is
located on an ext4 filesystem without a journal, then a WARN_ON_ONCE()
in ext4_write_inode() is triggered and the inode doesn't get written.
Fix this in xfsaild() by using the helper functions to save and restore
PF_MEMALLOC.
This can be reproduced as follows in the kvm-xfstests test appliance
modified to add the 'acct' Debian package, and with kvm-xfstests's
recommended kconfig modified to add CONFIG_BSD_PROCESS_ACCT=y:
mkfs.xfs -f /dev/vdb
mount /vdb
touch /vdb/file
chattr +S /vdb/file
accton /vdb/file
mkfs.xfs -f /dev/vdc
mount /vdc
umount /vdc
It causes:
WARNING: CPU: 1 PID: 336 at fs/iomap/buffered-io.c:1534
CPU: 1 PID: 336 Comm: xfsaild/vdc Not tainted 5.6.0-rc5 #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20191223_100556-anatol 04/01/2014
RIP: 0010:iomap_do_writepage+0x16b/0x1f0 fs/iomap/buffered-io.c:1534
[...]
Call Trace:
write_cache_pages+0x189/0x4d0 mm/page-writeback.c:2238
iomap_writepages+0x1c/0x33 fs/iomap/buffered-io.c:1642
xfs_vm_writepages+0x65/0x90 fs/xfs/xfs_aops.c:578
do_writepages+0x41/0xe0 mm/page-writeback.c:2344
__filemap_fdatawrite_range+0xd2/0x120 mm/filemap.c:421
file_write_and_wait_range+0x71/0xc0 mm/filemap.c:760
xfs_file_fsync+0x7a/0x2b0 fs/xfs/xfs_file.c:114
generic_write_sync include/linux/fs.h:2867 [inline]
xfs_file_buffered_aio_write+0x379/0x3b0 fs/xfs/xfs_file.c:691
call_write_iter include/linux/fs.h:1901 [inline]
new_sync_write+0x130/0x1d0 fs/read_write.c:483
__kernel_write+0x54/0xe0 fs/read_write.c:515
do_acct_process+0x122/0x170 kernel/acct.c:522
slow_acct_process kernel/acct.c:581 [inline]
acct_process+0x1d4/0x27c kernel/acct.c:607
do_exit+0x83d/0xbc0 kernel/exit.c:791
kthread+0xf1/0x140 kernel/kthread.c:257
ret_from_fork+0x27/0x50 arch/x86/entry/entry_64.S:352
This bug was originally reported by syzbot at
https://lore.kernel.org/r/0000000000000e7156059f751d7b@google.com.
Reported-by: syzbot+1f9dc49e8de2582d90c2@syzkaller.appspotmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
If regular inode has no compressed cluster, allow using 'chattr -c'
to remove its compress flag, recovering it to a non-compressed file.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Compressed cluster can be generated during dirty data writeback,
if there is dirty pages on compressed inode, it needs to disable
converting compressed inode to non-compressed one.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
stat_inc_compr_inode() needs to check FI_COMPRESSED_FILE flag, so
in f2fs_disable_compressed_file(), we should call stat_dec_compr_inode()
before clearing FI_COMPRESSED_FILE flag.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When SETUP_IOPOLL and SETUP_SQPOLL are both enabled, applications don't need
to do io completion events polling again, they can rely on io_sq_thread to do
polling work, which can reduce cpu usage and uring_lock contention.
I modify fio io_uring engine codes a bit to evaluate the performance:
static int fio_ioring_getevents(struct thread_data *td, unsigned int min,
continue;
}
- if (!o->sqpoll_thread) {
+ if (o->sqpoll_thread && o->hipri) {
r = io_uring_enter(ld, 0, actual_min,
IORING_ENTER_GETEVENTS);
if (r < 0) {
and use "fio -name=fiotest -filename=/dev/nvme0n1 -iodepth=$depth -thread
-rw=read -ioengine=io_uring -hipri=1 -sqthread_poll=1 -direct=1 -bs=4k
-size=10G -numjobs=1 -time_based -runtime=120"
original codes
--------------------------------------------------------------------
iodepth | 4 | 8 | 16 | 32 | 64
bw | 1133MB/s | 1519MB/s | 2090MB/s | 2710MB/s | 3012MB/s
fio cpu usage | 100% | 100% | 100% | 100% | 100%
--------------------------------------------------------------------
with patch
--------------------------------------------------------------------
iodepth | 4 | 8 | 16 | 32 | 64
bw | 1196MB/s | 1721MB/s | 2351MB/s | 2977MB/s | 3357MB/s
fio cpu usage | 63.8% | 74.4%% | 81.1% | 83.7% | 82.4%
--------------------------------------------------------------------
bw improve | 5.5% | 13.2% | 12.3% | 9.8% | 11.5%
--------------------------------------------------------------------
From above test results, we can see that bw has above 5.5%~13%
improvement, and fio process's cpu usage also drops much. Note this
won't improve io_sq_thread's cpu usage when SETUP_IOPOLL|SETUP_SQPOLL
are both enabled, in this case, io_sq_thread always has 100% cpu usage.
I think this patch will be friendly to applications which will often use
io_uring_wait_cqe() or similar from liburing.
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In f2fs_vm_page_mkwrite(), if inode is compress one, and current mmapped
page locates in compressed cluster, we have to call f2fs_get_dnode_of_data()
to get its physical block address before f2fs_wait_on_block_writeback().
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Lack of maintenance on comments may mislead developers, fix them.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs_inode.xattr_ver field was gone after commit d260081ccf
("f2fs: change recovery policy of xattr node block"), remove i_sem
lock coverage in f2fs_setxattr()
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This change solves below hangtask issue:
INFO: task kworker/u16:1:58 blocked for more than 122 seconds.
Not tainted 5.6.0-rc2-00590-g9983bdae4974e #11
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u16:1 D 0 58 2 0x00000000
Workqueue: writeback wb_workfn (flush-179:0)
Backtrace:
(__schedule) from [<c0913234>] (schedule+0x78/0xf4)
(schedule) from [<c017ec74>] (rwsem_down_write_slowpath+0x24c/0x4c0)
(rwsem_down_write_slowpath) from [<c0915f2c>] (down_write+0x6c/0x70)
(down_write) from [<c0435b80>] (f2fs_write_single_data_page+0x608/0x7ac)
(f2fs_write_single_data_page) from [<c0435fd8>] (f2fs_write_cache_pages+0x2b4/0x7c4)
(f2fs_write_cache_pages) from [<c043682c>] (f2fs_write_data_pages+0x344/0x35c)
(f2fs_write_data_pages) from [<c0267ee8>] (do_writepages+0x3c/0xd4)
(do_writepages) from [<c0310cbc>] (__writeback_single_inode+0x44/0x454)
(__writeback_single_inode) from [<c03112d0>] (writeback_sb_inodes+0x204/0x4b0)
(writeback_sb_inodes) from [<c03115cc>] (__writeback_inodes_wb+0x50/0xe4)
(__writeback_inodes_wb) from [<c03118f4>] (wb_writeback+0x294/0x338)
(wb_writeback) from [<c0312dac>] (wb_workfn+0x35c/0x54c)
(wb_workfn) from [<c014f2b8>] (process_one_work+0x214/0x544)
(process_one_work) from [<c014f634>] (worker_thread+0x4c/0x574)
(worker_thread) from [<c01564fc>] (kthread+0x144/0x170)
(kthread) from [<c01010e8>] (ret_from_fork+0x14/0x2c)
Reported-and-tested-by: Ondřej Jirman <megi@xff.cz>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
inode.i_blocks counts based on 512byte sector, we need to convert
to 4kb sized block count before comparing to i_compr_blocks.
In addition, add to print message when sanity check on inode
compression configs failed.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If CONFIG_NET is not set, gcc warns:
fs/io_uring.c:3110:12: warning: io_setup_async_msg defined but not used [-Wunused-function]
static int io_setup_async_msg(struct io_kiocb *req,
^~~~~~~~~~~~~~~~~~
There are many funcions wraped by CONFIG_NET, move them
together to simplify code, also fix this warning.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Minor tweaks.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Not easy to tell if we're going over the size of bits we can shove
in req->flags, so add an end-of-bits marker and a BUILD_BUG_ON()
check for it.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We have IORING_OP_PROVIDE_BUFFERS, but the only way to remove buffers
is to trigger IO on them. The usual case of shrinking a buffer pool
would be to just not replenish the buffers when IO completes, and
instead just free it. But it may be nice to have a way to manually
remove a number of buffers from a given group, and
IORING_OP_REMOVE_BUFFERS provides that functionality.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This adds support for the vectored read. This is limited to supporting
just 1 segment in the iov, and is provided just for convenience for
applications that use IORING_OP_READV already.
The iov helpers will be used for IORING_OP_RECVMSG as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If a server process has tons of pending socket connections, generally
it uses epoll to wait for activity. When the socket is ready for reading
(or writing), the task can select a buffer and issue a recv/send on the
given fd.
Now that we have fast (non-async thread) support, a task can have tons
of pending reads or writes pending. But that means they need buffers to
back that data, and if the number of connections is high enough, having
them preallocated for all possible connections is unfeasible.
With IORING_OP_PROVIDE_BUFFERS, an application can register buffers to
use for any request. The request then sets IOSQE_BUFFER_SELECT in the
sqe, and a given group ID in sqe->buf_group. When the fd becomes ready,
a free buffer from the specified group is selected. If none are
available, the request is terminated with -ENOBUFS. If successful, the
CQE on completion will contain the buffer ID chosen in the cqe->flags
member, encoded as:
(buffer_id << IORING_CQE_BUFFER_SHIFT) | IORING_CQE_F_BUFFER;
Once a buffer has been consumed by a request, it is no longer available
and must be registered again with IORING_OP_PROVIDE_BUFFERS.
Requests need to support this feature. For now, IORING_OP_READ and
IORING_OP_RECV support it. This is checked on SQE submission, a CQE with
res == -EOPNOTSUPP will be posted if attempted on unsupported requests.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
IORING_OP_PROVIDE_BUFFERS uses the buffer registration infrastructure to
support passing in an addr/len that is associated with a buffer ID and
buffer group ID. The group ID is used to index and lookup the buffers,
while the buffer ID can be used to notify the application which buffer
in the group was used. The addr passed in is the starting buffer address,
and length is each buffer length. A number of buffers to add with can be
specified, in which case addr is incremented by length for each addition,
and each buffer increments the buffer ID specified.
No validation is done of the buffer ID. If the application provides
buffers within the same group with identical buffer IDs, then it'll have
a hard time telling which buffer ID was used. The only restriction is
that the buffer ID can be a max of 16-bits in size, so USHRT_MAX is the
maximum ID that can be used.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Link: https://lore.kernel.org/r/20200309202327.GA8813@embeddedor
Signed-off-by: Kees Cook <keescook@chromium.org>
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
After more careful studying, Paul informs me that we cannot rely on
ordering of RCU callbacks in the way that the the tagged commit did.
The current construct looks like this:
void C(struct rcu_head *rhp)
{
do_something(rhp);
call_rcu(&p->rh, B);
}
call_rcu(&p->rh, A);
call_rcu(&p->rh, C);
and we're relying on ordering between A and B, which isn't guaranteed.
Make this explicit instead, and have a work item issue the rcu_barrier()
to ensure that A has run before we manually execute B.
While thorough testing never showed this issue, it's dependent on the
per-cpu load in terms of RCU callbacks. The updated method simplifies
the code as well, and eliminates the need to maintain an rcu_head in
the fileset data.
Fixes: c1e2148f8e ("io_uring: free fixed_file_data after RCU grace period")
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull driver core and debugfs fixes from Greg KH:
"Here are four small driver core / debugfs patches for 5.6-rc3:
- debugfs api cleanup now that all debugfs_create_regset32() callers
have been fixed up. This was waiting until after the -rc1 merge as
these fixes came in through different trees
- driver core sync state fixes based on reports of minor issues found
in the feature
All of these have been in linux-next with no reported issues"
* tag 'driver-core-5.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
driver core: Skip unnecessary work when device doesn't have sync_state()
driver core: Add dev_has_sync_state()
driver core: Call sync_state() even if supplier has no consumers
debugfs: remove return value of debugfs_create_regset32()
After FS_IOC_REMOVE_ENCRYPTION_KEY removes a key, it syncs the
filesystem and tries to get and put all inodes that were unlocked by the
key so that unused inodes get evicted via fscrypt_drop_inode().
Normally, the inodes are all clean due to the sync.
However, after the filesystem is sync'ed, userspace can modify and close
one of the files. (Userspace is *supposed* to close the files before
removing the key. But it doesn't always happen, and the kernel can't
assume it.) This causes the inode to be dirtied and have i_count == 0.
Then, fscrypt_drop_inode() failed to consider this case and indicated
that the inode can be dropped, causing the write to be lost.
On f2fs, other problems such as a filesystem freeze could occur due to
the inode being freed while still on f2fs's dirty inode list.
Fix this bug by making fscrypt_drop_inode() only drop clean inodes.
I've written an xfstest which detects this bug on ext4, f2fs, and ubifs.
Fixes: b1c0ec3599 ("fscrypt: add FS_IOC_REMOVE_ENCRYPTION_KEY ioctl")
Cc: <stable@vger.kernel.org> # v5.4+
Link: https://lore.kernel.org/r/20200305084138.653498-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Pull io_uring fixes from Jens Axboe:
"Here are a few io_uring fixes that should go into this release. This
contains:
- Removal of (now) unused io_wq_flush() and associated flag (Pavel)
- Fix cancelation lockup with linked timeouts (Pavel)
- Fix for potential use-after-free when freeing percpu ref for fixed
file sets
- io-wq cancelation fixups (Pavel)"
* tag 'io_uring-5.6-2020-03-07' of git://git.kernel.dk/linux-block:
io_uring: fix lockup with timeouts
io_uring: free fixed_file_data after RCU grace period
io-wq: remove io_wq_flush and IO_WQ_WORK_INTERNAL
io-wq: fix IO_WQ_WORK_NO_CANCEL cancellation
There is a recipe to deadlock the kernel: submit a timeout sqe with a
linked_timeout (e.g. test_single_link_timeout_ception() from liburing),
and SIGKILL the process.
Then, io_kill_timeouts() takes @ctx->completion_lock, but the timeout
isn't flagged with REQ_F_COMP_LOCKED, and will try to double grab it
during io_put_free() to cancel the linked timeout. Probably, the same
can happen with another io_kill_timeout() call site, that is
io_commit_cqring().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull btrfs fix from David Sterba:
"One fixup for DIO when in use with the new checksums, a missed case
where the checksum size was still assuming u32"
* tag 'for-5.6-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix RAID direct I/O reads with alternate csums