Commit Graph

65089 Commits

Author SHA1 Message Date
Pavel Begunkov
e85530ddda io-wq: remove unused IO_WQ_WORK_HAS_MM
IO_WQ_WORK_HAS_MM is set but never used, remove it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02 14:05:24 -07:00
Pavel Begunkov
02d27d8953 io_uring: extract kmsg copy helper
io_recvmsg() and io_sendmsg() duplicate nonblock -EAGAIN finilising
part, so add helper for that.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02 14:04:41 -07:00
Pavel Begunkov
b0a20349f2 io_uring: clean io_poll_complete
Deduplicate call to io_cqring_fill_event(), plain and easy

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02 14:04:37 -07:00
Pavel Begunkov
7d67af2c01 io_uring: add splice(2) support
Add support for splice(2).

- output file is specified as sqe->fd, so it's handled by generic code
- hash_reg_file handled by generic code as well
- len is 32bit, but should be fine
- the fd_in is registered file, when SPLICE_F_FD_IN_FIXED is set, which
is a splice flag (i.e. sqe->splice_flags).

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02 14:04:37 -07:00
Pavel Begunkov
8da11c1994 io_uring: add interface for getting files
Preparation without functional changes. Adds io_get_file(), that allows
to grab files not only into req->file.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02 14:04:37 -07:00
Pavel Begunkov
444ebb5768 splice: make do_splice public
Make do_splice(), so other kernel parts can reuse it

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02 14:04:31 -07:00
Pavel Begunkov
bcaec089c5 io_uring: remove req->in_async
req->in_async is not really needed, it only prevents propagation of
@nxt for fast not-blocked submissions. Remove it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02 14:04:29 -07:00
Pavel Begunkov
deb6dc0544 io_uring: don't do full *prep_worker() from io-wq
io_prep_async_worker() called io_wq_assign_next() do many useless checks:
io_req_work_grab_env() was already called during prep, and @do_hashed
is not ever used. Add io_prep_next_work() -- simplified version, that
can be called io-wq.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02 14:04:28 -07:00
Pavel Begunkov
5ea6216116 io_uring: don't call work.func from sync ctx
Many operations define custom work.func before getting into an io-wq.
There are several points against:
- it calls io_wq_assign_next() from outside io-wq, that may be confusing
- sync context would go unnecessary through io_req_cancelled()
- prototypes are quite different, so work!=old_work looks strange
- makes async/sync responsibilities fuzzy
- adds extra overhead

Don't call generic path and io-wq handlers from each other, but use
helpers instead

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02 14:04:24 -07:00
Jens Axboe
e441d1cf20 io_uring: io_accept() should hold on to submit reference on retry
Don't drop an early reference, hang on to it and let the caller drop
it. This makes it behave more like "regular" requests.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02 14:04:24 -07:00
Jens Axboe
29de5f6a35 io_uring: consider any io_read/write -EAGAIN as final
If the -EAGAIN happens because of a static condition, then a poll
or later retry won't fix it. We must call it again from blocking
condition. Play it safe and ensure that any -EAGAIN condition from read
or write must retry from async context.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02 14:04:24 -07:00
Pavel Begunkov
80ad894382 io-wq: remove io_wq_flush and IO_WQ_WORK_INTERNAL
io_wq_flush() is buggy, during cancelation of a flush, the associated
work may be passed to the caller's (i.e. io_uring) @match callback. That
callback is expecting it to be embedded in struct io_kiocb. Cancelation
of internal work probably doesn't make a lot of sense to begin with.

As the flush helper is no longer used, just delete it and the associated
work flag.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02 14:03:24 -07:00
Niklas Söderlund
3eb30c51a6 Documentation: nfsroot.rst: Fix references to nfsroot.rst
When converting and moving nfsroot.txt to nfsroot.rst the references to
the old text file was not updated to match the change, fix this.

Fixes: f9a9349846 ("Documentation: nfsroot.txt: convert to ReST")
Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://lore.kernel.org/r/20200212181332.520545-1-niklas.soderlund+renesas@ragnatech.se
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2020-03-02 13:11:46 -07:00
Pavel Begunkov
fc04c39bae io-wq: fix IO_WQ_WORK_NO_CANCEL cancellation
To cancel a work, io-wq sets IO_WQ_WORK_CANCEL and executes the
callback. However, IO_WQ_WORK_NO_CANCEL works will just execute and may
return next work, which will be ignored and lost.

Cancel the whole link.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02 07:20:08 -07:00
Linus Torvalds
e70869821a Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
 "Two more bug fixes (including a regression) for 5.6"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: potential crash on allocation error in ext4_alloc_flex_bg_array()
  jbd2: fix data races at struct journal_head
2020-03-01 16:35:08 -06:00
Dan Carpenter
37b0b6b8b9 ext4: potential crash on allocation error in ext4_alloc_flex_bg_array()
If sbi->s_flex_groups_allocated is zero and the first allocation fails
then this code will crash.  The problem is that "i--" will set "i" to
-1 but when we compare "i >= sbi->s_flex_groups_allocated" then the -1
is type promoted to unsigned and becomes UINT_MAX.  Since UINT_MAX
is more than zero, the condition is true so we call kvfree(new_groups[-1]).
The loop will carry on freeing invalid memory until it crashes.

Fixes: 7c990728b9 ("ext4: fix potential race between s_flex_groups online resizing and access")
Reviewed-by: Suraj Jitindar Singh <surajjs@amazon.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20200228092142.7irbc44yaz3by7nb@kili.mountain
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-02-29 17:48:08 -05:00
Qian Cai
6c5d911249 jbd2: fix data races at struct journal_head
journal_head::b_transaction and journal_head::b_next_transaction could
be accessed concurrently as noticed by KCSAN,

 LTP: starting fsync04
 /dev/zero: Can't open blockdev
 EXT4-fs (loop0): mounting ext3 file system using the ext4 subsystem
 EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
 ==================================================================
 BUG: KCSAN: data-race in __jbd2_journal_refile_buffer [jbd2] / jbd2_write_access_granted [jbd2]

 write to 0xffff99f9b1bd0e30 of 8 bytes by task 25721 on cpu 70:
  __jbd2_journal_refile_buffer+0xdd/0x210 [jbd2]
  __jbd2_journal_refile_buffer at fs/jbd2/transaction.c:2569
  jbd2_journal_commit_transaction+0x2d15/0x3f20 [jbd2]
  (inlined by) jbd2_journal_commit_transaction at fs/jbd2/commit.c:1034
  kjournald2+0x13b/0x450 [jbd2]
  kthread+0x1cd/0x1f0
  ret_from_fork+0x27/0x50

 read to 0xffff99f9b1bd0e30 of 8 bytes by task 25724 on cpu 68:
  jbd2_write_access_granted+0x1b2/0x250 [jbd2]
  jbd2_write_access_granted at fs/jbd2/transaction.c:1155
  jbd2_journal_get_write_access+0x2c/0x60 [jbd2]
  __ext4_journal_get_write_access+0x50/0x90 [ext4]
  ext4_mb_mark_diskspace_used+0x158/0x620 [ext4]
  ext4_mb_new_blocks+0x54f/0xca0 [ext4]
  ext4_ind_map_blocks+0xc79/0x1b40 [ext4]
  ext4_map_blocks+0x3b4/0x950 [ext4]
  _ext4_get_block+0xfc/0x270 [ext4]
  ext4_get_block+0x3b/0x50 [ext4]
  __block_write_begin_int+0x22e/0xae0
  __block_write_begin+0x39/0x50
  ext4_write_begin+0x388/0xb50 [ext4]
  generic_perform_write+0x15d/0x290
  ext4_buffered_write_iter+0x11f/0x210 [ext4]
  ext4_file_write_iter+0xce/0x9e0 [ext4]
  new_sync_write+0x29c/0x3b0
  __vfs_write+0x92/0xa0
  vfs_write+0x103/0x260
  ksys_write+0x9d/0x130
  __x64_sys_write+0x4c/0x60
  do_syscall_64+0x91/0xb05
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

 5 locks held by fsync04/25724:
  #0: ffff99f9911093f8 (sb_writers#13){.+.+}, at: vfs_write+0x21c/0x260
  #1: ffff99f9db4c0348 (&sb->s_type->i_mutex_key#15){+.+.}, at: ext4_buffered_write_iter+0x65/0x210 [ext4]
  #2: ffff99f5e7dfcf58 (jbd2_handle){++++}, at: start_this_handle+0x1c1/0x9d0 [jbd2]
  #3: ffff99f9db4c0168 (&ei->i_data_sem){++++}, at: ext4_map_blocks+0x176/0x950 [ext4]
  #4: ffffffff99086b40 (rcu_read_lock){....}, at: jbd2_write_access_granted+0x4e/0x250 [jbd2]
 irq event stamp: 1407125
 hardirqs last  enabled at (1407125): [<ffffffff980da9b7>] __find_get_block+0x107/0x790
 hardirqs last disabled at (1407124): [<ffffffff980da8f9>] __find_get_block+0x49/0x790
 softirqs last  enabled at (1405528): [<ffffffff98a0034c>] __do_softirq+0x34c/0x57c
 softirqs last disabled at (1405521): [<ffffffff97cc67a2>] irq_exit+0xa2/0xc0

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 68 PID: 25724 Comm: fsync04 Tainted: G L 5.6.0-rc2-next-20200221+ #7
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

The plain reads are outside of jh->b_state_lock critical section which result
in data races. Fix them by adding pairs of READ|WRITE_ONCE().

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Qian Cai <cai@lca.pw>
Link: https://lore.kernel.org/r/20200222043111.2227-1-cai@lca.pw
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-02-29 13:40:02 -05:00
Linus Torvalds
74dea5d99d Merge tag 'io_uring-5.6-2020-02-28' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:

 - Fix for a race with IOPOLL used with SQPOLL (Xiaoguang)

 - Only show ->fdinfo if procfs is enabled (Tobias)

 - Fix for a chain with multiple personalities in the SQEs

 - Fix for a missing free of personality idr on exit

 - Removal of the spin-for-work optimization

 - Fix for next work lookup on request completion

 - Fix for non-vec read/write result progation in case of links

 - Fix for a fileset references on switch

 - Fix for a recvmsg/sendmsg 32-bit compatability mode

* tag 'io_uring-5.6-2020-02-28' of git://git.kernel.dk/linux-block:
  io_uring: fix 32-bit compatability with sendmsg/recvmsg
  io_uring: define and set show_fdinfo only if procfs is enabled
  io_uring: drop file set ref put/get on switch
  io_uring: import_single_range() returns 0/-ERROR
  io_uring: pick up link work on submit reference drop
  io-wq: ensure work->task_pid is cleared on init
  io-wq: remove spin-for-work optimization
  io_uring: fix poll_list race for SETUP_IOPOLL|SETUP_SQPOLL
  io_uring: fix personality idr leak
  io_uring: handle multiple personalities in link chains
2020-02-28 11:39:14 -08:00
Eric W. Biederman
69879c01a0 proc: Remove the now unnecessary internal mount of proc
There remains no more code in the kernel using pids_ns->proc_mnt,
therefore remove it from the kernel.

The big benefit of this change is that one of the most error prone and
tricky parts of the pid namespace implementation, maintaining kernel
mounts of proc is removed.

In addition removing the unnecessary complexity of the kernel mount
fixes a regression that caused the proc mount options to be ignored.
Now that the initial mount of proc comes from userspace, those mount
options are again honored.  This fixes Android's usage of the proc
hidepid option.

Reported-by: Alistair Strachan <astrachan@google.com>
Fixes: e94591d0d9 ("proc: Convert proc_mount to use mount_ns.")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-02-28 12:06:14 -06:00
Linus Torvalds
bfeb4f9977 Merge tag 'zonefs-5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs
Pull zonefs fixes from Damien Le Moal:
 "Two fixes in here:

   - Revert the initial decision to silently ignore IOCB_NOWAIT for
     asynchronous direct IOs to sequential zone files. Instead, return
     an error to the user to signal that the feature is not supported
     (from Christoph)

   - A fix to zonefs Kconfig to select FS_IOMAP to avoid build failures
     if no other file system already selected this option (from
     Johannes)"

* tag 'zonefs-5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
  zonefs: select FS_IOMAP
  zonefs: fix IOCB_NOWAIT handling
2020-02-28 08:34:47 -08:00
David S. Miller
9f6e055907 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
The mptcp conflict was overlapping additions.

The SMC conflict was an additional and removal happening at the same
time.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27 18:31:39 -08:00
Jens Axboe
d876836204 io_uring: fix 32-bit compatability with sendmsg/recvmsg
We must set MSG_CMSG_COMPAT if we're in compatability mode, otherwise
the iovec import for these commands will not do the right thing and fail
the command with -EINVAL.

Found by running the test suite compiled as 32-bit.

Cc: stable@vger.kernel.org
Fixes: aa1fa28fc7 ("io_uring: add support for recvmsg()")
Fixes: 0fa03c624d ("io_uring: add support for sendmsg()")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-27 14:17:49 -07:00
Al Viro
239eb98338 atomic_open(): saner calling conventions (return dentry on success)
Currently it either returns -E... or puts (nd->path.mnt,dentry)
into *path and returns 0.  Make it return ERR_PTR(-E...) or
dentry; adjust the caller.  Fewer arguments and it's easier
to keep track of *path contents that way.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-27 14:43:56 -05:00
Al Viro
bd7c4b5083 handle_mounts(): start building a sane wrapper for follow_managed()
All callers of follow_managed() follow it on success with the same steps -
d_backing_inode(path->dentry) is calculated and stored into some struct inode *
variable and, in all but one case, an unsigned variable (nd->seq to be) is
zeroed.  The single exception is lookup_fast() and there zeroing is correct
thing to do - not doing it is a pointless microoptimization.

	Add a wrapper for follow_managed() that would do that combination.
It's mostly a vehicle for code massage - it will be changing quite a bit,
and the current calling conventions are by no means final.  Right now it
takes path, nameidata and (as out params) inode and seq, similar to
__follow_mount_rcu().  Which will soon get folded into it...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-27 14:43:56 -05:00
Al Viro
31d1726d72 make build_open_flags() treat O_CREAT | O_EXCL as implying O_NOFOLLOW
O_CREAT | O_EXCL means "-EEXIST if we run into a trailing symlink".
As it is, we might or might not have LOOKUP_FOLLOW in op->intent
in that case - that depends upon having O_NOFOLLOW in open flags.
It doesn't matter, since we won't be checking it in that case -
do_last() bails out earlier.

However, making sure it's not set (i.e. acting as if we had an explicit
O_NOFOLLOW) makes the behaviour more explicit and allows to reorder the
check for O_CREAT | O_EXCL in do_last() with the call of step_into()
immediately following it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-27 14:43:56 -05:00
Al Viro
1c9f5e06a6 follow_automount() doesn't need the entire nameidata
Only the address of ->total_link_count and the flags.
And fix an off-by-one is ELOOP detection - make it
consistent with symlink following, where we check if
the pre-increment value has reached 40, rather than
check the post-increment one.

[kudos to Christian Brauner for spotted braino]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-27 14:43:55 -05:00
Al Viro
25e195aa1e follow_automount(): get rid of dead^Wstillborn code
1) no instances of ->d_automount() have ever made use of the "return
ERR_PTR(-EISDIR) if you don't feel like mounting anything" - that's
a rudiment of plans that got superseded before the thing went into
the tree.  Despite the comment in follow_automount(), autofs has
never done that.

2) if there's no ->d_automount() in dentry_operations, filesystems
should not set DCACHE_NEED_AUTOMOUNT in the first place.  None have
ever done so...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-27 14:43:55 -05:00
Al Viro
26df6034fd fix automount/automount race properly
Protection against automount/automount races (two threads hitting the same
referral point at the same time) is based upon do_add_mount() prevention of
identical overmounts - trying to overmount the root of mounted tree with
the same tree fails with -EBUSY.  It's unreliable (the other thread might've
mounted something on top of the automount it has triggered) *and* causes
no end of headache for follow_automount() and its caller, since
finish_automount() behaves like do_new_mount() - if the mountpoint to be is
overmounted, it mounts on top what's overmounting it.  It's not only wrong
(we want to go into what's overmounting the automount point and quietly
discard what we planned to mount there), it introduces the possibility of
original parent mount getting dropped.  That's what 8aef188452 (VFS: Fix
vfsmount overput on simultaneous automount) deals with, but it can't do
anything about the reliability of conflict detection - if something had
been overmounted the other thread's automount (e.g. that other thread
having stepped into automount in mount(2)), we don't get that -EBUSY and
the result is
	 referral point under automounted NFS under explicit overmount
under another copy of automounted NFS

What we need is finish_automount() *NOT* digging into overmounts - if it
finds one, it should just quietly discard the thing it was asked to mount.
And don't bother with actually crossing into the results of finish_automount() -
the same loop that calls follow_automount() will do that just fine on the
next iteration.

IOW, instead of calling lock_mount() have finish_automount() do it manually,
_without_ the "move into overmount and retry" part.  And leave crossing into
the results to the caller of follow_automount(), which simplifies it a lot.

Moral: if you end up with a lot of glue working around the calling conventions
of something, perhaps these calling conventions are simply wrong...

Fixes: 8aef188452 (VFS: Fix vfsmount overput on simultaneous automount)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-27 14:40:43 -05:00
Chao Yu
df77fbd8c5 f2fs: fix to avoid potential deadlock
Using f2fs_trylock_op() in f2fs_write_compressed_pages() to avoid potential
deadlock like we did in f2fs_write_single_data_page().

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-02-27 10:16:45 -08:00
Chao Yu
097a768650 f2fs: add missing function name in kernel message
Otherwise, we can not distinguish the exact location of messages,
when there are more than one places printing same message.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-02-27 10:16:45 -08:00
Chao Yu
0b32dc1864 f2fs: recycle unused compress_data.chksum feild
In Struct compress_data, chksum field was never used, remove it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-02-27 10:16:45 -08:00
Chao Yu
61fbae2b2b f2fs: fix to avoid NULL pointer dereference
Unable to handle kernel NULL pointer dereference at virtual address 00000000
PC is at f2fs_free_dic+0x60/0x2c8
LR is at f2fs_decompress_pages+0x3c4/0x3e8
f2fs_free_dic+0x60/0x2c8
f2fs_decompress_pages+0x3c4/0x3e8
__read_end_io+0x78/0x19c
f2fs_post_read_work+0x6c/0x94
process_one_work+0x210/0x48c
worker_thread+0x2e8/0x44c
kthread+0x110/0x120
ret_from_fork+0x10/0x18

In f2fs_free_dic(), we can not use f2fs_put_page(,1) to release dic->tpages[i],
as the page's mapping is NULL.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-02-27 10:16:45 -08:00
Eric Biggers
7fa6d59816 f2fs: fix leaking uninitialized memory in compressed clusters
When the compressed data of a cluster doesn't end on a page boundary,
the remainder of the last page must be zeroed in order to avoid leaking
uninitialized memory to disk.

Fixes: 4c8ff7095b ("f2fs: support data compression")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-02-27 10:16:44 -08:00
Sahitya Tummala
bf22c3cc8c f2fs: fix the panic in do_checkpoint()
There could be a scenario where f2fs_sync_meta_pages() will not
ensure that all F2FS_DIRTY_META pages are submitted for IO. Thus,
resulting in the below panic in do_checkpoint() -

f2fs_bug_on(sbi, get_pages(sbi, F2FS_DIRTY_META) &&
				!f2fs_cp_error(sbi));

This can happen in a low-memory condition, where shrinker could
also be doing the writepage operation (stack shown below)
at the same time when checkpoint is running on another core.

schedule
down_write
f2fs_submit_page_write -> by this time, this page in page cache is tagged
			as PAGECACHE_TAG_WRITEBACK and PAGECACHE_TAG_DIRTY
			is cleared, due to which f2fs_sync_meta_pages()
			cannot sync this page in do_checkpoint() path.
f2fs_do_write_meta_page
__f2fs_write_meta_page
f2fs_write_meta_page
shrink_page_list
shrink_inactive_list
shrink_node_memcg
shrink_node
kswapd

Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-02-27 10:16:44 -08:00
Chao Yu
dc5a941223 f2fs: fix to wait all node page writeback
There is a race condition that we may miss to wait for all node pages
writeback, fix it.

- fsync()				- shrink
 - f2fs_do_sync_file
					 - __write_node_page
					  - set_page_writeback(page#0)
					  : remove DIRTY/TOWRITE flag
  - f2fs_fsync_node_pages
  : won't find page #0 as TOWRITE flag was removeD
  - f2fs_wait_on_node_pages_writeback
  : wont' wait page #0 writeback as it was not in fsync_node_list list.
					   - f2fs_add_fsync_node_entry

Fixes: 50fa53eccf ("f2fs: fix to avoid broken of dnode block list")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-02-27 10:16:44 -08:00
Vasily Averin
6c871b7314 pstore: pstore_ftrace_seq_next should increase position index
In Aug 2018 NeilBrown noticed
commit 1f4aace60b ("fs/seq_file.c: simplify seq_file iteration code and interface")
"Some ->next functions do not increment *pos when they return NULL...
Note that such ->next functions are buggy and should be fixed.
A simple demonstration is

 dd if=/proc/swaps bs=1000 skip=1

Choose any block size larger than the size of /proc/swaps. This will
always show the whole last line of /proc/swaps"

/proc/swaps output was fixed recently, however there are lot of other
affected files, and one of them is related to pstore subsystem.

If .next function does not change position index, following .show function
will repeat output related to current position index.

There are at least 2 related problems:
- read after lseek beyond end of file, described above by NeilBrown
  "dd if=<AFFECTED_FILE> bs=1000 skip=1" will generate whole last list
- read after lseek on in middle of last line will output expected rest of
  last line but then repeat whole last line once again.

If .show() function generates multy-line output (like
pstore_ftrace_seq_show() does ?) following bash script cycles endlessly

 $ q=;while read -r r;do echo "$((++q)) $r";done < AFFECTED_FILE

Unfortunately I'm not familiar enough to pstore subsystem and was unable
to find affected pstore-related file on my test node.

If .next function does not change position index, following .show function
will repeat output related to current position index.

Cc: stable@vger.kernel.org
Fixes: 1f4aace60b ("fs/seq_file.c: simplify seq_file iteration code ...")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Link: https://lore.kernel.org/r/4e49830d-4c88-0171-ee24-1ee540028dad@virtuozzo.com
[kees: with robustness tweak from Joel Fernandes <joelaf@google.com>]
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-02-27 08:04:59 -08:00
Tobias Klauser
bebdb65e07 io_uring: define and set show_fdinfo only if procfs is enabled
Follow the pattern used with other *_show_fdinfo functions and only
define and use io_uring_show_fdinfo and its helper functions if
CONFIG_PROC_FS is set.

Fixes: 87ce955b24 ("io_uring: add ->show_fdinfo() for the io_uring file descriptor")
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-27 06:56:21 -07:00
Bob Peterson
cc44457f16 gfs2: leaf_dealloc needs to allocate one more revoke
Function leaf_dealloc was not allocating enough journal space for
revokes. Before, it allocated 'l_blocks' revokes, but it needs one
more for the revoke of the dinode that is modified. This patch adds
the needed revoke entry to function leaf_dealloc.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-02-27 07:53:18 -06:00
Bob Peterson
c9ebc4b737 gfs2: allow journal replay to hold sd_log_flush_lock
Before this patch, journal replays could stomp on log flushes
and each other because both log flushes and journal replays used
the same sd_log_bio. Function gfs2_log_flush prevents other log
flushes from interfering by taking the sd_log_flush_lock rwsem
during the flush. However, it does not protect against journal
replays. This patch allows the journal replay to take the same
sd_log_flush_lock rwsem so use of the sd_log_bio is not stomped.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-02-27 07:53:18 -06:00
Bob Peterson
019dd669bd gfs2: don't allow releasepage to free bd still used for revokes
Before this patch, function gfs2_releasepage would free any bd
elements that had been used for the page being released. However,
those bd elements may still be queued to the sd_log_revokes list,
in which case we cannot free them until the revoke has been issued.

This patch adds additional checks for bds that are still being
used for revokes.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-02-27 07:53:18 -06:00
Bob Peterson
ca399c96e9 gfs2: flesh out delayed withdraw for gfs2_log_flush
Function gfs2_log_flush() had a few places where it tried to withdraw
from the file system when errors were encountered. The problem is,
it should delay those withdraws until the log flush lock is no longer
held.

This patch creates a new function just for delayed withdraws for
situations like this. If errors=panic was specified on mount, we
still want to do it the old fashioned way because the panic it does
not help to delay in that situation.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-27 07:53:18 -06:00
Bob Peterson
1c634f94c3 gfs2: Do proper error checking for go_sync family of glops functions
Before this patch, function do_xmote would try to sync out the glock
dirty data by calling the appropriate glops function XXX_go_sync()
but it did not check for a good return code. If the sync was not
possible due to an io error or whatever, do_xmote would continue on
and call go_inval and release the glock to other cluster nodes.
When those nodes go to replay the journal, they may already be holding
glocks for the journal records that should have been synced, but were
not due to the ignored error.

This patch introduces proper error code checking to the go_sync
family of glops functions.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-27 07:53:18 -06:00
Bob Peterson
df5db5f9ee gfs2: Don't demote a glock until its revokes are written
Before this patch, run_queue would demote glocks based on whether
there are any more holders. But if the glock has pending revokes that
haven't been written to the media, giving up the glock might end in
file system corruption if the revokes never get written due to
io errors, node crashes and fences, etc. In that case, another node
will replay the metadata blocks associated with the glock, but
because the revoke was never written, it could replay that block
even though the glock had since been granted to another node who
might have made changes.

This patch changes the logic in run_queue so that it never demotes
a glock until its count of pending revokes reaches zero.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-27 07:53:18 -06:00
Bob Peterson
2ca0c2fbf3 gfs2: drain the ail2 list after io errors
Before this patch, gfs2_logd continually tried to flush its journal
log, after the file system is withdrawn. We don't want to write anything
to the journal, lest we add corruption. Best course of action is to
drain the ail1 into the ail2 list (via gfs2_ail1_empty) then drain the
ail2 list with a new function, ail2_drain.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-27 07:53:18 -06:00
Bob Peterson
b1676cbb11 gfs2: Withdraw in gfs2_ail1_flush if write_cache_pages fails
Before this patch, function gfs2_ail1_start_one would return any
errors it received from write_cache_pages (except -EBUSY) but it did
not withdraw. Since function gfs2_ail1_flush just checks for the bad
return code and loops, the loop might potentially never end.
This patch adds some logic to allow it to exit the loop and withdraw
properly when errors are received from write_cache_pages.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-27 07:53:18 -06:00
Bob Peterson
9ff7828935 gfs2: Do log_flush in gfs2_ail_empty_gl even if ail list is empty
Before this patch, if gfs2_ail_empty_gl saw there was nothing on
the ail list, it would return and not flush the log. The problem
is that there could still be a revoke for the rgrp sitting on the
sd_log_le_revoke list that's been recently taken off the ail list.
But that revoke still needs to be written, and the rgrp_go_inval
still needs to call log_flush_wait to ensure the revokes are all
properly written to the journal before we relinquish control of
the glock to another node. If we give the glock to another node
before we have this knowledge, the node might crash and its journal
replayed, in which case the missing revoke would allow the journal
replay to replay the rgrp over top of the rgrp we already gave to
another node, thus overwriting its changes and corrupting the
file system.

This patch makes gfs2_ail_empty_gl still call gfs2_log_flush rather
than returning.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-27 07:53:18 -06:00
Bob Peterson
d93ae386ef gfs2: Check for log write errors before telling dlm to unlock
Before this patch, function do_xmote just assumed all the writes
submitted to the journal were finished and successful, and it
called the go_unlock function to release the dlm lock. But if
they're not, and a revoke failed to make its way to the journal,
a journal replay on another node will cause corruption if we
let the go_inval function continue and tell dlm to release the
glock to another node. This patch adds a couple checks for errors
in do_xmote after the calls to go_sync and go_inval. If an error
is found, we cannot withdraw yet, because the withdraw itself
uses glocks to make the file system read-only. Instead, we flag
the error. Later, asserts should cause another node to replay
the journal before continuing, thus protecting rgrp and dinode
glocks and maintaining the integrity of the metadata. Note that
we only need to do this for journaled glocks. System glocks
should be able to progress even under withdrawn conditions.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-27 07:53:18 -06:00
Bob Peterson
f05b86db31 gfs2: Prepare to withdraw as soon as an IO error occurs in log write
Before this patch, function gfs2_end_log_write would detect any IO
errors writing to the journal and put out an appropriate message,
but it never set a withdrawing condition. Eventually, the log daemon
would see the error and determine it was time to withdraw, but in
the meantime, other processes could continue running as if nothing
bad ever happened. The biggest consequence is that __gfs2_glock_put
would BUG() when it saw that there were still unwritten items.

This patch sets the WITHDRAWING status as soon as an IO error is
detected, and that way, the BUG will be avoided so the file system
can be properly withdrawn and unmounted.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-27 07:53:18 -06:00
Bob Peterson
5e4c7632aa gfs2: Issue revokes more intelligently
Before this patch, function gfs2_write_revokes would call
gfs2_ail1_empty, then traverse the sd_ail1_list looking for
transactions that had bds which were no longer queued to a glock.
And if it found some, it would try to issue revokes for them, up to
a predetermined maximum. There were two problems with how it did
this. First was the fact that gfs2_ail1_empty moves transactions
which have nothing remaining on the ail1 list from the sd_ail1_list
to the sd_ail2_list, thus making its traversal of sd_ail1_list
miss them completely, and therefore, never issue revokes for them.
Second was the fact that there were three traversals (or partial
traversals) of the sd_ail1_list, each of which took and then
released the sd_ail_lock lock: First inside gfs2_ail1_empty,
second to determine if there are any revokes to be issued, and
third to actually issue them. All this taking and releasing of the
sd_ail_lock meant other processes could modify the lists and the
conditions in which we're working.

This patch simplies the whole process by adding a new parameter
to function gfs2_ail1_empty, max_revokes. For normal calls, this
is passed in as 0, meaning we don't want to issue any revokes.
For function gfs2_write_revokes, we pass in the maximum number
of revokes we can, thus allowing gfs2_ail1_empty to add the
revokes where needed. This simplies the code, allows for a single
holding of the sd_ail_lock, and allows gfs2_ail1_empty to add
revokes for all the necessary bd items without missing any.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-27 07:53:18 -06:00
Bob Peterson
7d9f924958 gfs2: Add verbose option to check_journal_clean
Before this patch, function check_journal_clean would give messages
related to journal recovery. That's fine for mount time, but when a
node withdraws and forces replay that way, we don't want all those
distracting and misleading messages. This patch adds a new parameter
to make those messages optional.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-27 07:53:17 -06:00