If there's not enough free sections each of which consistis of large segments,
we can hit no free section for upcoming section allocation. Let's reclaim some
prefree segments by writing checkpoints.
Signed-off-by: Byungki Lee <dominicus79@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As Yanming reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=215904
The kernel message is shown below:
kernel BUG at fs/f2fs/inode.c:825!
Call Trace:
evict+0x282/0x4e0
__dentry_kill+0x2b2/0x4d0
shrink_dentry_list+0x17c/0x4f0
shrink_dcache_parent+0x143/0x1e0
do_one_tree+0x9/0x30
shrink_dcache_for_umount+0x51/0x120
generic_shutdown_super+0x5c/0x3a0
kill_block_super+0x90/0xd0
kill_f2fs_super+0x225/0x310
deactivate_locked_super+0x78/0xc0
cleanup_mnt+0x2b7/0x480
task_work_run+0xc8/0x150
exit_to_user_mode_prepare+0x14a/0x150
syscall_exit_to_user_mode+0x1d/0x40
do_syscall_64+0x48/0x90
The root cause is: inode node and dnode node share the same nid,
so during f2fs_evict_inode(), dnode node truncation will invalidate
its NAT entry, so when truncating inode node, it fails due to
invalid NAT entry, result in inode is still marked as dirty, fix
this issue by clearing dirty for inode and setting SBI_NEED_FSCK
flag in filesystem.
output from dump.f2fs:
[print_node_info: 354] Node ID [0xf:15] is inode
i_nid[0] [0x f : 15]
Cc: stable@vger.kernel.org
Reported-by: Ming Yan <yanming@tju.edu.cn>
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
F2FS zoned support has power of 2 zone size assumption in many places
such as in __f2fs_issue_discard_zone, init_blkz_info. As the power of 2
requirement has been removed from the block layer, explicitly add a
condition in f2fs to allow only power of 2 zone size devices.
This condition will be relaxed once those calculation based on power of
2 is made generic.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Instead of calling bdev_zone_sectors() multiple times, call
it once and cache the value locally. This will make the
subsequent change easier to read.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There are multiple calculations and reads of fields of sbi that should
be protected by stat_lock. As stat_lock is not used to read these
values in statfs, this can lead to inconsistent results.
Extend the locking to prevent this issue.
Commit c9c8ed50d9 ("f2fs: fix to avoid potential race on
sbi->unusable_block_count access/update")
already added the use of sbi->stat_lock in statfs in
order to make the calculation of multiple, different fields atomic so
that results are consistent. This is similar to that patch regarding the
change in statfs.
Signed-off-by: Niels Dossche <dossche.niels@gmail.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The block layer for zoned disk can reorder the FUA'ed IOs. Let's use flush
command to keep the write order.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Syzbot triggers two WARNs in f2fs_is_valid_blkaddr and
__is_bitmap_valid. For example, in f2fs_is_valid_blkaddr,
if type is DATA_GENERIC_ENHANCE or DATA_GENERIC_ENHANCE_READ,
it invokes WARN_ON if blkaddr is not in the right range.
The call trace is as follows:
f2fs_get_node_info+0x45f/0x1070
read_node_page+0x577/0x1190
__get_node_page.part.0+0x9e/0x10e0
__get_node_page
f2fs_get_node_page+0x109/0x180
do_read_inode
f2fs_iget+0x2a5/0x58b0
f2fs_fill_super+0x3b39/0x7ca0
Fix these two WARNs by replacing WARN_ON with dump_stack.
Reported-by: syzbot+763ae12a2ede1d99d4dc@syzkaller.appspotmail.com
Signed-off-by: Dongliang Mu <mudongliangabcd@gmail.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
To move the list iterator variable into the list_for_each_entry_*()
macro in the future it should be avoided to use the list iterator
variable after the loop body.
To *never* use the list iterator variable after the loop it was
concluded to use a separate iterator variable instead of a
found boolean [1].
This removes the need to use a found variable and simply checking if
the variable was set, can determine if the break/goto was hit.
Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/
Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In preparation to limit the scope of a list iterator to the list
traversal loop, the usage of the list iterator variable 'next' should
be avoided past the loop body [1].
Instead of calling list_move_tail() on 'next' after the loop, it is
called within the loop if the correct location was found.
After the loop it covers the case if no location was found and it
should be inserted based on the 'head' of the list.
Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/
Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The list iterator variable will be a bogus pointer if no break was hit.
Dereferencing it (cur->page in this case) could load an out-of-bounds/undefined
value making it unsafe to use that in the comparision to determine if the
specific element was found.
Since 'cur->page' *can* be out-ouf-bounds it cannot be guaranteed that
by chance (or intention of an attacker) it matches the value of 'page'
even though the correct element was not found.
This is fixed by using a separate list iterator variable for the loop
and only setting the original variable if a suitable element was found.
Then determing if the element was found is simply checking if the
variable is set.
Fixes: 8c242db9b8 ("f2fs: fix stale ATOMIC_WRITTEN_PAGE private pointer")
Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As Wenqing reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=215765
It will cause a kernel panic with steps:
- mkdir mnt
- mount tmp40.img mnt
- ls mnt
folio_mark_dirty+0x33/0x50
f2fs_add_regular_entry+0x541/0xad0 [f2fs]
f2fs_add_dentry+0x6c/0xb0 [f2fs]
f2fs_do_add_link+0x182/0x230 [f2fs]
__recover_dot_dentries+0x2d6/0x470 [f2fs]
f2fs_lookup+0x5af/0x6a0 [f2fs]
__lookup_slow+0xac/0x200
lookup_slow+0x45/0x70
walk_component+0x16c/0x250
path_lookupat+0x8b/0x1f0
filename_lookup+0xef/0x250
user_path_at_empty+0x46/0x70
vfs_statx+0x98/0x190
__do_sys_newlstat+0x41/0x90
__x64_sys_newlstat+0x1a/0x30
do_syscall_64+0x37/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
The root cause is for special file: e.g. character, block, fifo or
socket file, f2fs doesn't assign address space operations pointer array
for mapping->a_ops field, so, in a fuzzed image, if inline_dots flag was
tagged in special file, during lookup(), when f2fs runs into
__recover_dot_dentries(), it will cause NULL pointer access once
f2fs_add_regular_entry() calls a_ops->set_dirty_page().
Fixes: 510022a858 ("f2fs: add F2FS_INLINE_DOTS to recover missing dot dentries")
Reported-by: Wenqing Liu <wenqingliu0120@gmail.com>
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This can be removed, since f2fs_alloc_nid() actually doesn't require to block
checkpoint and __f2fs_build_free_nids() is covered by nm_i->nat_tree_lock.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
So that it can reduce the possibility that file be unpinned forcely by
foreground GC due to .i_gc_failures[GC_FAILURE_PIN] exceeds threshold.
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In order to skip migrating section which contains data of pinned
file in advance.
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If the file preallocated blocks and fsync'ed, we should not truncate them during
roll-forward recovery which will recover i_size correctly back.
Fixes: d4dd19ec1e ("f2fs: do not expose unwritten blocks to user by DIO")
Cc: <stable@vger.kernel.org> # 5.17+
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes wrong initialization.
Fixes: 50c63009f6 ("f2fs: avoid an infinite loop in f2fs_sync_dirty_inodes")
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Let's attach io_flags to bio only, so that we can merge IOs given original
io_flags only.
Fixes: 64bf0eef01 ("f2fs: pass the bio operation to bio_alloc_bioset")
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add the 'zone_cap_mb' kernel module parameter. This parameter defines the
zone capacity. The zone capacity must be less than or equal to the zone
size.
Report that sequential write zones and gap zones are paired in the Zoned
Block Device Characteristics VPD page (page B6h).
This patch has been tested as follows:
modprobe scsi_debug delay=0 sector_size=512 dev_size_mb=128 zbc=host-managed zone_nr_conv=16 zone_size_mb=4 zone_cap_mb=3
modprobe brd rd_nr=1 rd_size=$((1<<20))
mkfs.f2fs -m /dev/ram0 -c /dev/${scsi_debug_dev}
mount /dev/ram0 /mnt
# Run a fio job that uses /mnt
Link: https://lore.kernel.org/r/20220421183023.3462291-10-bvanassche@acm.org
Cc: Douglas Gilbert <dgilbert@interlog.com>
Acked-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
[ bvanassche: Switched to reporting a constant zone starting LBA granularity ]
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Bug: 230616396
(cherry picked from commit 4a5fc1c6d7 mkp-scsi/staging)
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Change-Id: Ie8010a82a0070123cd56fcd0bf70ffa81d397d66
ZBC-2 allows host-managed disks to report gap zones. This allow zoned disks
to report an offset between data zone starts that is a power of two even if
the number of logical blocks with data per zone is not a power of two.
Another new feature in ZBC-2 is support for constant zone starting LBA
offsets. For zoned disks that report a constant zone starting LBA offset,
hide the gap zones from the block layer. Report the offset between data
zone starts as zone size and report the number of logical blocks with data
per zone as the zone capacity.
Link: https://lore.kernel.org/r/20220421183023.3462291-7-bvanassche@acm.org
Acked-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
[ bvanassche: Reworked this patch ]
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Bug: 230616396
(cherry picked from commit c976e588b3 mkp-scsi/staging)
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Change-Id: Iae809a10943a805d21bd4fdffc20ee9d6960d6e4
Encrypted files traditionally haven't supported DIO, due to the need to
encrypt/decrypt the data. However, when the encryption is implemented
using inline encryption (blk-crypto) instead of the traditional
filesystem-layer encryption, it is straightforward to support DIO.
Therefore, make f2fs support DIO on files that are using inline
encryption. Since f2fs uses iomap for DIO, and fscrypt support was
already added to iomap DIO, this just requires two small changes:
- Let DIO proceed when supported, by checking fscrypt_dio_supported()
instead of assuming that encrypted files never support DIO.
- In f2fs_iomap_begin(), use fscrypt_limit_io_blocks() to limit the
length of the mapping in the rare case where a DUN discontiguity
occurs in the middle of an extent. The iomap DIO implementation
requires this, since it assumes that it can submit a bio covering (up
to) the whole mapping, without checking fscrypt constraints itself.
Co-developed-by: Satya Tangirala <satyat@google.com>
Signed-off-by: Satya Tangirala <satyat@google.com>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Link: https://lore.kernel.org/r/20220128233940.79464-5-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Encrypted files traditionally haven't supported DIO, due to the need to
encrypt/decrypt the data. However, when the encryption is implemented
using inline encryption (blk-crypto) instead of the traditional
filesystem-layer encryption, it is straightforward to support DIO.
Therefore, make ext4 support DIO on files that are using inline
encryption. Since ext4 uses iomap for DIO, and fscrypt support was
already added to iomap DIO, this just requires two small changes:
- Let DIO proceed when supported, by checking fscrypt_dio_supported()
instead of assuming that encrypted files never support DIO.
- In ext4_iomap_begin(), use fscrypt_limit_io_blocks() to limit the
length of the mapping in the rare case where a DUN discontiguity
occurs in the middle of an extent. The iomap DIO implementation
requires this, since it assumes that it can submit a bio covering (up
to) the whole mapping, without checking fscrypt constraints itself.
Co-developed-by: Satya Tangirala <satyat@google.com>
Signed-off-by: Satya Tangirala <satyat@google.com>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org>
Link: https://lore.kernel.org/r/20220128233940.79464-4-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Encrypted files traditionally haven't supported DIO, due to the need to
encrypt/decrypt the data. However, when the encryption is implemented
using inline encryption (blk-crypto) instead of the traditional
filesystem-layer encryption, it is straightforward to support DIO.
Add support for this to the iomap DIO implementation by calling
fscrypt_set_bio_crypt_ctx() to set encryption contexts on the bios.
Don't check for the rare case where a DUN (crypto data unit number)
discontiguity creates a boundary that bios must not cross. Instead,
filesystems are expected to handle this in ->iomap_begin() by limiting
the length of the mapping so that iomap doesn't have to worry about it.
Co-developed-by: Satya Tangirala <satyat@google.com>
Signed-off-by: Satya Tangirala <satyat@google.com>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220128233940.79464-3-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Encrypted files traditionally haven't supported DIO, due to the need to
encrypt/decrypt the data. However, when the encryption is implemented
using inline encryption (blk-crypto) instead of the traditional
filesystem-layer encryption, it is straightforward to support DIO.
In preparation for supporting this, add the following functions:
- fscrypt_dio_supported() checks whether a DIO request is supported as
far as encryption is concerned. Encrypted files will only support DIO
when inline encryption is used and the I/O request is properly
aligned; this function checks these preconditions.
- fscrypt_limit_io_blocks() limits the length of a bio to avoid crossing
a place in the file that a bio with an encryption context cannot
cross due to a DUN discontiguity. This function is needed by
filesystems that use the iomap DIO implementation (which operates
directly on logical ranges, so it won't use fscrypt_mergeable_bio())
and that support FSCRYPT_POLICY_FLAG_IV_INO_LBLK_32.
Co-developed-by: Satya Tangirala <satyat@google.com>
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220128233940.79464-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
As bughunter reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=215709
f2fs may hang when mounting a fuzzed image, the dmesg shows as below:
__filemap_get_folio+0x3a9/0x590
pagecache_get_page+0x18/0x60
__get_meta_page+0x95/0x460 [f2fs]
get_checkpoint_version+0x2a/0x1e0 [f2fs]
validate_checkpoint+0x8e/0x2a0 [f2fs]
f2fs_get_valid_checkpoint+0xd0/0x620 [f2fs]
f2fs_fill_super+0xc01/0x1d40 [f2fs]
mount_bdev+0x18a/0x1c0
f2fs_mount+0x15/0x20 [f2fs]
legacy_get_tree+0x28/0x50
vfs_get_tree+0x27/0xc0
path_mount+0x480/0xaa0
do_mount+0x7c/0xa0
__x64_sys_mount+0x8b/0xe0
do_syscall_64+0x38/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae
The root cause is cp_pack_total_block_count field in checkpoint was fuzzed
to one, as calcuated, two cp pack block locates in the same block address,
so then read latter cp pack block, it will block on the page lock due to
the lock has already held when reading previous cp pack block, fix it by
adding sanity check for cp_pack_total_block_count.
Cc: stable@vger.kernel.org
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Let's enable GC_URGENT_HIGH mode during f2fs_disable_checkpoint(),
so that we can use SSR allocator for GCed data/node persistence,
it can improve the performance due to it avoiding migration of
data/node locates in selected target segment of SSR allocator.
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When compressed file has blocks, f2fs_ioc_start_atomic_write will succeed,
but compressed flag will be remained in inode. If write partial compreseed
cluster and commit atomic write will cause data corruption.
This is the reproduction process:
Step 1:
create a compressed file ,write 64K data , call fsync(), then the blocks
are write as compressed cluster.
Step2:
iotcl(F2FS_IOC_START_ATOMIC_WRITE) --- this should be fail, but not.
write page 0 and page 3.
iotcl(F2FS_IOC_COMMIT_ATOMIC_WRITE) -- page 0 and 3 write as normal file,
Step3:
drop cache.
read page 0-4 -- Since page 0 has a valid block address, read as
non-compressed cluster, page 1 and 2 will be filled with compressed data
or zero.
The root cause is, after commit 7eab7a6968 ("f2fs: compress: remove
unneeded read when rewrite whole cluster"), in step 2, f2fs_write_begin()
only set target page dirty, and in f2fs_commit_inmem_pages(), we will write
partial raw pages into compressed cluster, result in corrupting compressed
cluster layout.
Fixes: 4c8ff7095b ("f2fs: support data compression")
Fixes: 7eab7a6968 ("f2fs: compress: remove unneeded read when rewrite whole cluster")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
It needs to initialized sbi->gc_mode to GC_NORMAL explicitly.
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We need a mid level of gc urgent mode to do GC forcibly in a period
of given gc_urgent_sleep_time, but not like using greedy GC approach
and switching to SSR mode such as gc urgent high mode. This can be
used for more aggressive periodic storage clean up.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In lz4_decompress_pages(), if size of decompressed data is not equal to
expected one, we should print the size rather than size of target buffer
for decompressed data, fix it.
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
iput() has already judged the incoming parameter, so there is
no need to repeat the judgment here.
Signed-off-by: Wang Xiaojun <wangxiaojun11@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
F2FS_FITS_IN_INODE only cares the type of f2fs inode, so there
is no need to read node page of f2fs inode.
Signed-off-by: Jia Yang <jiayang5@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Unfair rwsem should be used when blk-cg is on. Otherwise, there is regression.
FYI, we noticed a -26.7% regression of aim7.jobs-per-min due to commit:
commit: e4544b63a7 ("f2fs: move f2fs to use reader-unfair rwsems")
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
in testcase: aim7
on test machine: 88 threads 2 sockets Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz with 128G memory
with following parameters:
disk: 4BRD_12G
md: RAID0
fs: f2fs
test: sync_disk_rw
load: 100
cpufreq_governor: performance
ucode: 0x500320a
test-description: AIM7 is a traditional UNIX system level benchmark suite which is used to test and measure the performance of multiuser system.
test-url: https://sourceforge.net/projects/aimbench/files/aim-suite7/
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>