When using f2fs on a zoned block device with 2MiB zone size, IO errors
occurs because f2fs tries to write data to a zone that has not been reset.
The cause is that f2fs tries to discard multiple zones at once. This is
caused by a condition in f2fs_clear_prefree_segments that does not check
for zoned block devices when setting the discard range. This leads to
invalid reset commands and write pointer mismatches.
This patch fixes the zoned block device with 2MiB zone size to reset one
zone at a time.
Signed-off-by: Yonggil Song <yonggil.song@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This is a last part to remove the memory sharing for rb_tree in extent_cache.
This should also fix arm32 memory alignment issue.
[struct extent_node] [struct rb_entry]
[0] struct rb_node rb_node; [0] struct rb_node rb_node;
union { union {
struct { struct {
[16] unsigned int fofs; [12] unsigned int ofs;
unsigned int len; unsigned int len;
};
unsigned long long key;
} __packed;
Cc: <stable@vger.kernel.org>
Fixes: 13054c548a ("f2fs: introduce infra macro and data structure of rb-tree extent cache")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This is a second part to remove the mixed use of rb_tree in discard_cmd from
extent_cache.
This should also fix arm32 memory alignment issue caused by shared rb_entry.
[struct discard_cmd] [struct rb_entry]
[0] struct rb_node rb_node; [0] struct rb_node rb_node;
union { union {
struct { struct {
[16] block_t lstart; [12] unsigned int ofs;
block_t len; unsigned int len;
};
unsigned long long key;
} __packed;
Cc: <stable@vger.kernel.org>
Fixes: 004b686218 ("f2fs: use rb-tree to track pending discard commands")
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Let's reduce the complexity of mixed use of rb_tree in victim_entry from
extent_cache and discard_cmd.
This should fix arm32 memory alignment issue caused by shared rb_entry.
[struct victim_entry] [struct rb_entry]
[0] struct rb_node rb_node; [0] struct rb_node rb_node;
union {
struct {
unsigned int ofs;
unsigned int len;
};
[16] unsigned long long mtime; [12] unsigned long long key;
} __packed;
Cc: <stable@vger.kernel.org>
Fixes: 093749e296 ("f2fs: support age threshold based garbage collection")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When f2fs skipped a gc round during victim migration, there was a bug which
would skip all upcoming gc rounds unconditionally because skipped_gc_rwsem
was not initialized. It fixes the bug by correctly initializing the
skipped_gc_rwsem inside the gc loop.
Fixes: 6f8d445506 ("f2fs: avoid fi->i_gc_rwsem[WRITE] lock in f2fs_gc")
Signed-off-by: Yonggil Song <yonggil.song@samsung.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We should set the error code when dqget() failed.
Fixes: 2c1d030569 ("f2fs: support F2FS_IOC_FS{GET,SET}XATTR")
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
After commit 26b5a07919 ("f2fs: cleanup dirty pages if recover failed"),
f2fs_sync_inode_meta() is only used in checkpoint.c, so
f2fs_sync_inode_meta() should only be visible inside. Delete the
declaration in the header file and change f2fs_sync_inode_meta()
to static.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Fix the nid_t field so that its size is correctly reported in the text
format embedded in trace.dat files. As it stands, it is reported as
being of size 4:
field:nid_t nid[3]; offset:24; size:4; signed:0;
Instead of 12:
field:nid_t nid[3]; offset:24; size:12; signed:0;
This also fixes the reported offset of subsequent fields so that they
match with the actual struct layout.
Signed-off-by: Douglas Raillard <douglas.raillard@arm.com>
Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Try to make the filesystem-level decryption functions in fs/crypto/
aware of large folios. This includes making fscrypt_decrypt_bio()
support the case where the bio contains large folios, and making
fscrypt_decrypt_pagecache_blocks() take a folio instead of a page.
There's no way to actually test this with large folios yet, but I've
tested that this doesn't cause any regressions.
Note that this patch just handles *decryption*, not encryption which
will be a little more difficult.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20230127224202.355629-1-ebiggers@kernel.org
Try to make fs/verity/verify.c aware of large folios. This includes
making fsverity_verify_bio() support the case where the bio contains
large folios, and adding a function fsverity_verify_folio() which is the
equivalent of fsverity_verify_page().
There's no way to actually test this with large folios yet, but I've
tested that this doesn't cause any regressions.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20230127221529.299560-1-ebiggers@kernel.org
After each filesystem block (as represented by a buffer_head) has been
read from disk by block_read_full_folio(), verify it if needed. The
verification is done on the fsverity_read_workqueue. Also allow reads
of verity metadata past i_size, as required by ext4.
This is needed to support fsverity on ext4 filesystems where the
filesystem block size is less than the page size.
The new code is compiled away when CONFIG_FS_VERITY=n.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20221223203638.41293-11-ebiggers@kernel.org
Make FS_IOC_ENABLE_VERITY support values of
fsverity_enable_arg::block_size other than PAGE_SIZE.
To make this possible, rework build_merkle_tree(), which was reading
data and hash pages from the file and assuming that they were the same
thing as "blocks".
For reading the data blocks, just replace the direct pagecache access
with __kernel_read(), to naturally read one block at a time.
(A disadvantage of the above is that we lose the two optimizations of
hashing the pagecache pages in-place and forcing the maximum readahead.
That shouldn't be very important, though.)
The hash block reads are a bit more difficult to handle, as the only way
to do them is through fsverity_operations::read_merkle_tree_page().
Instead, let's switch to the single-pass tree construction algorithm
that fsverity-utils uses. This eliminates the need to read back any
hash blocks while the tree is being built, at the small cost of an extra
block-sized memory buffer per Merkle tree level. This is probably what
I should have done originally.
Taken together, the above two changes result in page-size independent
code that is also a bit simpler than what we had before.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com>
Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20221223203638.41293-8-ebiggers@kernel.org
Add support for verifying data from verity files whose Merkle tree block
size is less than the page size. The main use case for this is to allow
a single Merkle tree block size to be used across all systems, so that
only one set of fsverity file digests and signatures is needed.
To do this, eliminate various assumptions that the Merkle tree block
size and the page size are the same:
- Make fsverity_verify_page() a wrapper around a new function
fsverity_verify_blocks() which verifies one or more blocks in a page.
- When a Merkle tree block is needed, get the corresponding page and
only verify and use the needed portion. (The Merkle tree continues to
be read and cached in page-sized chunks; that doesn't need to change.)
- When the Merkle tree block size and page size differ, use a bitmap
fsverity_info::hash_block_verified to keep track of which Merkle tree
blocks have been verified, as PageChecked cannot be used directly.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com>
Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20221223203638.41293-7-ebiggers@kernel.org
In preparation for allowing the Merkle tree block size to differ from
PAGE_SIZE, replace fsverity_hash_page() with fsverity_hash_block(). The
new function is similar to the old one, but it operates on the block at
the given offset in the page instead of on the full page.
(For now, all callers still pass a full page.)
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com>
Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20221223203638.41293-6-ebiggers@kernel.org
Currently, there is an implementation limit where files can't have more
than 8 Merkle tree levels. With SHA-256 and 4K blocks, this limit is
never reached, since a file would need to be larger than 2**64 bytes to
need 9 levels. However, with SHA-512, 9 levels are needed for files
larger than about 1.15 EB, which is possible on btrfs. Therefore, this
limit technically became reachable when btrfs added fsverity support.
Meanwhile, support for merkle_tree_block_size < PAGE_SIZE will introduce
another implementation limit on file size, resulting from the use of an
in-memory bitmap to track which Merkle tree blocks have been verified.
In any case, currently FS_IOC_ENABLE_VERITY fails with EINVAL when the
file is too large. This is undocumented, and also ambiguous since
EINVAL can mean other things too. Let's change the error code to EFBIG,
which is much clearer, and document it.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com>
Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20221223203638.41293-5-ebiggers@kernel.org
First, calculate max_ra_pages more efficiently by using the bio size.
Second, calculate the number of readahead pages from the hash page
index, instead of calculating it ahead of time using the data page
index. This ends up being a bit simpler, especially since level 0 is
last in the tree, so we can just limit the readahead to the tree size.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com>
Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20221223203638.41293-3-ebiggers@kernel.org
fs/verity/ isn't consistent with whether Merkle tree block indices are
'unsigned long' or 'u64'. There's no real point to using u64 for them,
though, since (a) a Merkle tree with over ULONG_MAX blocks would only be
needed for a file larger than MAX_LFS_FILESIZE, and (b) for reads, the
status of all Merkle tree blocks has to be tracked in memory.
Therefore, let's make things a bit more efficient on 32-bit systems by
using 'unsigned long[]' for merkle_tree_params::level_start, instead of
'u64[]'. Also, to be extra safe, explicitly check that there aren't
more than ULONG_MAX Merkle tree blocks.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com>
Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20221223203638.41293-2-ebiggers@kernel.org
I've gotten very little use out of these debug messages, and I'm not
aware of anyone else having used them.
Indeed, sprinkling pr_debug around is not really a best practice these
days, especially for filesystem code. Tracepoints are used instead.
Let's just remove these and start from a clean slate.
This change does not affect info, warning, and error messages.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20221215060420.60692-1-ebiggers@kernel.org
fsverity_operations::write_merkle_tree_block is passed the index of the
block to write and the log base 2 of the block size. However, all
implementations of it use these parameters only to calculate the
position and the size of the block, in bytes.
Therefore, make ->write_merkle_tree_block take 'pos' and 'size'
parameters instead of 'index' and 'log_blocksize'.
Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20221214224304.145712-5-ebiggers@kernel.org
Now that fscrypt_add_test_dummy_key() is only called by
setup_file_encryption_key() and not by the individual filesystems,
un-export it. Also change its prototype to take the
fscrypt_key_specifier directly, as the caller already has it.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20230208062107.199831-6-ebiggers@kernel.org
Now that the key associated with the "test_dummy_operation" mount option
is added on-demand when it's needed, rather than immediately when the
filesystem is mounted, fscrypt_destroy_keyring() no longer needs to be
called from __put_super() to avoid a memory leak on mount failure.
Remove this call, which was causing confusion because it appeared to be
a sleep-in-atomic bug (though it wasn't, for a somewhat-subtle reason).
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20230208062107.199831-5-ebiggers@kernel.org
When the key for an inode is not found but the inode is using the
test_dummy_encryption policy, automatically add the
test_dummy_encryption key to the filesystem keyring. This eliminates
the need for all the individual filesystems to do this at mount time,
which is a bit tricky to clean up from on failure.
Note: this covers the call to fscrypt_find_master_key() from inode key
setup, but not from the fscrypt ioctls. So, this isn't *exactly* the
same as the key being present from the very beginning. I think we can
tolerate that, though, since the inode key setup caller is the only one
that actually matters in the context of test_dummy_encryption.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20230208062107.199831-2-ebiggers@kernel.org
We should not truncate replaced blocks, and were supposed to truncate the first
part as well.
This reverts commit 78a99fe625.
Cc: stable@vger.kernel.org
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
MAIN_SEGS is for data area, while TOTAL_SEGS includes data and metadata.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For each loop add a local f2fs_sb_info pointer insted of looking it up.
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Export ipu_policy as a string in debugfs for better readability and
it can help us better understand some strategies of the file system.
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Since commit ee6d3dd4ed ("driver core: make kobj_type constant.")
the driver core allows the usage of const struct kobj_type.
Take advantage of this to constify the structure definitions to prevent
modification at runtime.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In do_read_inode(), sanity check for extent cache should be called after
f2fs_init_read_extent_tree(), fix it.
Fixes: 72840cccc0 ("f2fs: allocate the extent_cache by default")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
IPU policy can be disabled, let's add description for it and other policy.
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For LFS mode, it should update outplace and no need inplace update.
When using LFS mode for small-volume devices, IPU will not be used,
and the OPU writing method is actually used, but F2FS_IPU_FORCE can
be read from the ipu_policy node, which is different from the actual
situation. And remount to lfs mode should be disallowed when
f2fs ipu is enabled, let's fix it.
Fixes: 84b89e5d94 ("f2fs: add auto tuning for small devices")
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Convert to use iostat_lat_type as parameter instead of raw number.
BTW, move NUM_PREALLOC_IOSTAT_CTXS to the header file, adjust
iostat_lat[{0,1,2}] to iostat_lat[{READ_IO,WRITE_SYNC_IO,WRITE_ASYNC_IO}]
in tracepoint function, and rename iotype to page_type to match the definition.
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Commit 5911d2d1d1 ("f2fs: introduce gc_merge mount option") forgot
to show nogc_merge option, let's fix it.
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>