[ Upstream commit 10f966bbf5 ]
f2fs uses EFAULT as error number to indicate filesystem is corrupted
all the time, but generic filesystems use EUCLEAN for such condition,
we need to change to follow others.
This patch adds two new macros as below to wrap more generic error
code macros, and spread them in code.
EFSBADCRC EBADMSG /* Bad CRC detected */
EFSCORRUPTED EUCLEAN /* Filesystem is corrupted */
Reported-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b42b179bda ]
As Jungyeon reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=203221
- Overview
When mounting the attached crafted image and running program, this error is reported.
The image is intentionally fuzzed from a normal f2fs image for testing and I enabled option CONFIG_F2FS_CHECK_FS on.
- Reproduces
cc poc_07.c
mkdir test
mount -t f2fs tmp.img test
cp a.out test
cd test
sudo ./a.out
- Messages
kernel BUG at fs/f2fs/node.c:1279!
RIP: 0010:read_node_page+0xcf/0xf0
Call Trace:
__get_node_page+0x6b/0x2f0
f2fs_iget+0x8f/0xdf0
f2fs_lookup+0x136/0x320
__lookup_slow+0x92/0x140
lookup_slow+0x30/0x50
walk_component+0x1c1/0x350
path_lookupat+0x62/0x200
filename_lookup+0xb3/0x1a0
do_fchmodat+0x3e/0xa0
__x64_sys_chmod+0x12/0x20
do_syscall_64+0x43/0xf0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
On below paths, we can have opportunity to readahead inode page
- gc_node_segment -> f2fs_ra_node_page
- gc_data_segment -> f2fs_ra_node_page
- f2fs_fill_dentries -> f2fs_ra_node_page
Unlike synchronized read, on readahead path, we can set page uptodate
before verifying page's checksum, then read_node_page() will trigger
kernel panic once it encounters a uptodated page w/ incorrect checksum.
So considering readahead scenario, we have to do checksum each time
when loading inode page even if it is uptodated.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 626bcf2b7c ]
As Jungyeon reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=203225
- Overview
When mounting the attached crafted image and unmounting it, following errors are reported.
Additionally, it hangs on sync after unmounting.
The image is intentionally fuzzed from a normal f2fs image for testing.
Compile options for F2FS are as follows.
CONFIG_F2FS_FS=y
CONFIG_F2FS_STAT_FS=y
CONFIG_F2FS_FS_XATTR=y
CONFIG_F2FS_FS_POSIX_ACL=y
CONFIG_F2FS_CHECK_FS=y
- Reproduces
mkdir test
mount -t f2fs tmp.img test
touch test/t
umount test
sync
- Messages
kernel BUG at fs/f2fs/node.c:3073!
RIP: 0010:f2fs_destroy_node_manager+0x2f0/0x300
Call Trace:
f2fs_put_super+0xf4/0x270
generic_shutdown_super+0x62/0x110
kill_block_super+0x1c/0x50
kill_f2fs_super+0xad/0xd0
deactivate_locked_super+0x35/0x60
cleanup_mnt+0x36/0x70
task_work_run+0x75/0x90
exit_to_usermode_loop+0x93/0xa0
do_syscall_64+0xba/0xf0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0010:f2fs_destroy_node_manager+0x2f0/0x300
NAT table is corrupted, so reserved meta/node inode ids were added into
free list incorrectly, during file creation, since reserved id has cached
in inode hash, so it fails the creation and preallocated nid can not be
released later, result in kernel panic.
To fix this issue, let's do nid boundary check during free nid loading.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 8b6810f8ac ]
As Jungyeon reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=203219
- Overview
When mounting the attached crafted image and running program, I got this error.
Additionally, it hangs on sync after running the program.
The image is intentionally fuzzed from a normal f2fs image for testing and I enabled option CONFIG_F2FS_CHECK_FS on.
- Reproduces
cc poc_06.c
mkdir test
mount -t f2fs tmp.img test
cp a.out test
cd test
sudo ./a.out
sync
- Messages
kernel BUG at fs/f2fs/node.c:1183!
RIP: 0010:f2fs_remove_inode_page+0x294/0x2d0
Call Trace:
f2fs_evict_inode+0x2a3/0x3a0
evict+0xba/0x180
__dentry_kill+0xbe/0x160
dentry_kill+0x46/0x180
dput+0xbb/0x100
do_renameat2+0x3c9/0x550
__x64_sys_rename+0x17/0x20
do_syscall_64+0x43/0xf0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
The reason is f2fs_remove_inode_page() will trigger kernel panic due to
inconsistent i_blocks value of inode.
To avoid panic, let's just print debug message and set SBI_NEED_FSCK to
give a hint to fsck for latter repairing of potential image corruption.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: fix build warning and add unlikely]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit ef2a007134 upstream.
Testcase to reproduce this bug:
1. mkfs.f2fs /dev/sdd
2. mount -t f2fs /dev/sdd /mnt/f2fs
3. touch /mnt/f2fs/file
4. sync
5. chattr +A /mnt/f2fs/file
6. xfs_io -f /mnt/f2fs/file -c "fsync"
7. godown /mnt/f2fs
8. umount /mnt/f2fs
9. mount -t f2fs /dev/sdd /mnt/f2fs
10. chattr -A /mnt/f2fs/file
11. xfs_io -f /mnt/f2fs/file -c "fsync"
12. umount /mnt/f2fs
13. mount -t f2fs /dev/sdd /mnt/f2fs
14. lsattr /mnt/f2fs/file
-----------------N- /mnt/f2fs/file
But actually, we expect the corrct result is:
-------A---------N- /mnt/f2fs/file
The reason is in step 9) we missed to recover cold bit flag in inode
block, so later, in fsync, we will skip write inode block due to below
condition check, result in lossing data in another SPOR.
f2fs_fsync_node_pages()
if (!IS_DNODE(page) || !is_cold_node(page))
continue;
Note that, I guess that some non-dir inode has already lost cold bit
during POR, so in order to reenable recovery for those inode, let's
try to recover cold bit in f2fs_iget() to save more fsynced data.
Fixes: c56675750d ("f2fs: remove unneeded set_cold_node()")
Cc: <stable@vger.kernel.org> 4.17+
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When CONFIG_F2FS_FAULT_INJECTION is disabled, we get a warning about an
unused label:
fs/f2fs/segment.c: In function '__submit_discard_cmd':
fs/f2fs/segment.c:1059:1: error: label 'submit' defined but not used [-Werror=unused-label]
This could be fixed by adding another #ifdef around it, but the more
reliable way of doing this seems to be to remove the other #ifdefs
where that is easily possible.
By defining time_to_inject() as a trivial stub, most of the checks for
CONFIG_F2FS_FAULT_INJECTION can go away. This also leads to nicer
formatting of the code.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Introduce nat_list_lock to protect nm_i->nat_entries list, and manage
it as a LRU list, refresh location for therein recent accessed entries
in the list.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs recovery flow is relying on dnode block link list, it means fsynced
file recovery depends on previous dnode's persistence in the list, so
during fsync() we should wait on all regular inode's dnode writebacked
before issuing flush.
By this way, we can avoid dnode block list being broken by out-of-order
IO submission due to IO scheduler or driver.
Sheng Yong helps to do the test with this patch:
Target:/data (f2fs, -)
64MB / 32768KB / 4KB / 8
1 / PERSIST / Index
Base:
SEQ-RD(MB/s) SEQ-WR(MB/s) RND-RD(IOPS) RND-WR(IOPS) Insert(TPS) Update(TPS) Delete(TPS)
1 867.82 204.15 41440.03 41370.54 680.8 1025.94 1031.08
2 871.87 205.87 41370.3 40275.2 791.14 1065.84 1101.7
3 866.52 205.69 41795.67 40596.16 694.69 1037.16 1031.48
Avg 868.7366667 205.2366667 41535.33333 40747.3 722.21 1042.98 1054.753333
After:
SEQ-RD(MB/s) SEQ-WR(MB/s) RND-RD(IOPS) RND-WR(IOPS) Insert(TPS) Update(TPS) Delete(TPS)
1 798.81 202.5 41143 40613.87 602.71 838.08 913.83
2 805.79 206.47 40297.2 41291.46 604.44 840.75 924.27
3 814.83 206.17 41209.57 40453.62 602.85 834.66 927.91
Avg 806.4766667 205.0466667 40883.25667 40786.31667 603.3333333 837.83 922.0033333
Patched/Original:
0.928332713 0.999074239 0.984300676 1.000957528 0.835398753 0.803303994 0.874141189
It looks like atomic write will suffer performance regression.
I suspect that the criminal is that we forcing to wait all dnode being in
storage cache before we issue PREFLUSH+FUA.
BTW, will commit ("f2fs: don't need to wait for node writes for atomic write")
cause the problem: we will lose data of last transaction after SPO, even if
atomic write return no error:
- atomic_open();
- write() P1, P2, P3;
- atomic_commit();
- writeback data: P1, P2, P3;
- writeback node: N1, N2, N3; <--- If N1, N2 is not writebacked, N3 with fsync_mark is
writebacked, In SPOR, we won't find N3 since node chain is broken, turns out that losing
last transaction.
- preflush + fua;
- power-cut
If we don't wait dnode writeback for atomic_write:
SEQ-RD(MB/s) SEQ-WR(MB/s) RND-RD(IOPS) RND-WR(IOPS) Insert(TPS) Update(TPS) Delete(TPS)
1 779.91 206.03 41621.5 40333.16 716.9 1038.21 1034.85
2 848.51 204.35 40082.44 39486.17 791.83 1119.96 1083.77
3 772.12 206.27 41335.25 41599.65 723.29 1055.07 971.92
Avg 800.18 205.55 41013.06333 40472.99333 744.0066667 1071.08 1030.18
Patched/Original:
0.92108464 1.001526693 0.987425886 0.993268102 1.030180511 1.026942031 0.976702294
SQLite's performance recovers.
Jaegeuk:
"Practically, I don't see db corruption becase of this. We can excuse to lose
the last transaction."
Finally, we decide to keep original implementation of atomic write interface
sematics that we don't wait all dnode writeback before preflush+fua submission.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There is a subtle race condition to invoke f2fs_bug_on() in shutdown tests. I've
confirmed that the last checkpoint is preserved in consistent state, so it'd be
fine to just return error at this moment.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Fsyncer will wait on all dnode pages of regular writeback before flushing,
if there are async dnode pages blocked by IO scheduler, it may decrease
fsync's performance.
In this patch, we choose to let f2fs_balance_fs_bg() to trigger checkpoint
to flush these dnode pages of regular, so async IO of dnode page can be
elimitnated, making fsyncer only need to wait for sync IO.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If caller of __get_meta_page() can handle error, let's propagate error
from __get_meta_page().
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In synchronous scenario, like in checkpoint(), we are going to flush
dirty node pages to device synchronously, we can easily failed
writebacking node page due to trylock_page() failure, especially in
condition of intensive lock competition, which can cause long latency
of checkpoint(). So let's use lock_page() in synchronous scenario to
avoid this issue.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch used f2fs_bitmap_size macro to calculate mem used by
free nid bitmap, and stat used mem including aligned part.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces verify_blkaddr to check meta/data block address
with valid range to detect bug earlier.
In addition, once we encounter an invalid blkaddr, notice user to run
fsck to fix, and let the kernel panic.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Enable in-memory inode checksum to protect metadata blocks from
in-memory scribbles when checking consistency, which has no
performance requirements.
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Once we shutdown f2fs, we have to flush stale pages in order to unmount
the system. In order to make stable, we need to stop fault injection as well.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Let's flush journal nat entries for speed up in the next run.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Although mixed sync/async IOs can have continuous LBA, as they have
different IO priority, block IO scheduler will add them into different
queues and commit them separately, result in splited IOs which causes
wrose performance.
This patch gives high priority to synchronous IO of nodes, means that
once synchronous flow starts, it can interrupt asynchronous writeback
flow of system flusher, so more big IOs can be expected.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We should never falsify wbc->sync_mode passed from mm, otherwise
mm can trigger writeback with wrong IO priority.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Introduce clear_radix_tree_dirty_tag to include common codes for cleanup.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
- rename is_valid_blkaddr() to is_valid_meta_blkaddr() for readability.
- introduce is_valid_blkaddr() for cleanup.
No logic change in this patch.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch clears PageError in some pages tagged by read path, but when we
write the pages with valid contents, writepage should clear the bit likewise
ext4.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch reverts copied f2fs_set_page_dirty_nobuffer to use generic function
for stability.
This reverts commit fe76b796fc.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Nat entry set is used only in checkpoint(), and during checkpoint() we
won't flush new nat entry with unallocated address, so we don't need to
add new nat entry into nat set, then nat_entry_set::entry_cnt can
indicate actual entry count we need to flush in checkpoint().
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For example, in arm64, free_nid_bitmap should be aligned to word size in order
to use bit operations.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When setting COLD_BIT_SHIFT flag in node block, we only need to call
set_cold_node() in new_node_page() and recover_inode_page() during
node page initialization. So remove unneeded set_cold_node() in other
places.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Don't remain dirtied page cache in f2fs after shutdown, it can mitigate
memory pressure of whole system, in order to keep other modules working
properly.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Cgroup writeback requires explicit support from the filesystem.
f2fs's data and node writeback IOs go through __write_data_page,
which sets fio for submiting IOs. So, we add io_wbc for fio,
associate bios with blkcg by invoking wbc_init_bio() and
account IOs issuing by wbc_account_io().
In addtion, f2fs_fill_super() is updated to set SB_I_CGROUPWB.
Meta writeback IOs is left alone by this patch and will always be
attributed to the root cgroup.
The results show that f2fs can throttle writeback nicely for
data writing and file creating.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
After checkpoint,
1. creat a new file A ,(with dirty inode && dirty inode page && xattr info)
2. backgroud wb write back file A inode page (without update from inode cache)
3. fsync file A, write back inode page of file A with inode cache info
4. sudden power off before new checkpoint
In this case, recovery process will try to recover a zero inode
page. Inline xattr flag of file A will be miss and xattr info
will be taken as blkaddr index.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch supports to inject fault into kvmalloc/kvzalloc.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces f2fs_kzalloc based on f2fs_kmalloc in order to
support error injection for kzalloc().
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Commit d260081ccf ("f2fs: change recovery policy of xattr node block")
removes the use of blkaddr, which is no longer used. So remove the
parameter.
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>