Currently fscrypt_zeroout_range() issues and waits on a bio for each
block it writes, which makes it very slow.
Optimize it to write up to 16 pages at a time instead.
Also add a function comment, and improve reliability by allowing the
allocations of the bio and the first ciphertext page to wait on the
corresponding mempools.
Link: https://lore.kernel.org/r/20191226160813.53182-1-ebiggers@kernel.org
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
The commit 643fa9612b ("fscrypt: remove filesystem specific
build config option") removed modular support for fs/crypto. This
causes the Crypto API to be built-in whenever fscrypt is enabled.
This makes it very difficult for me to test modular builds of
the Crypto API without disabling fscrypt which is a pain.
As fscrypt is still evolving and it's developing new ties with the
fs layer, it's hard to build it as a module for now.
However, the actual algorithms are not required until a filesystem
is mounted. Therefore we can allow them to be built as modules.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://lore.kernel.org/r/20191227024700.7vrzuux32uyfdgum@gondor.apana.org.au
Signed-off-by: Eric Biggers <ebiggers@google.com>
<linux/fscrypt.h> defines ioctl numbers using the macros like _IOWR()
which are defined in <linux/ioctl.h>, so <linux/ioctl.h> should be
included as a prerequisite, like it is in many other kernel headers.
In practice this doesn't really matter since anyone referencing these
ioctl numbers will almost certainly include <sys/ioctl.h> too in order
to actually call ioctl(). But we might as well fix this.
Link: https://lore.kernel.org/r/20191219185624.21251-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
fscrypt_valid_enc_modes() is only used by policy.c, so move it to there.
Also adjust the order of the checks to be more natural, matching the
numerical order of the constants and also keeping AES-256 (the
recommended default) first in the list.
No change in behavior.
Link: https://lore.kernel.org/r/20191209211829.239800-4-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
FSCRYPT_POLICY_FLAG_DIRECT_KEY is currently only allowed with Adiantum
encryption. But FS_IOC_SET_ENCRYPTION_POLICY allowed it in combination
with other encryption modes, and an error wasn't reported until later
when the encrypted directory was actually used.
Fix it to report the error earlier by validating the correct use of the
DIRECT_KEY flag in fscrypt_supported_policy(), similar to how we
validate the IV_INO_LBLK_64 flag.
Link: https://lore.kernel.org/r/20191209211829.239800-3-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Make fscrypt_supported_policy() call new functions
fscrypt_supported_v1_policy() and fscrypt_supported_v2_policy(), to
reduce the indentation level and make the code easier to read.
Also adjust the function comment to mention that whether the encryption
policy is supported can also depend on the inode.
No change in behavior.
Link: https://lore.kernel.org/r/20191209211829.239800-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Add a function fscrypt_needs_contents_encryption() which takes an inode
and returns true if it's an encrypted regular file and the kernel was
built with fscrypt support.
This will allow replacing duplicated checks of IS_ENCRYPTED() &&
S_ISREG() on the I/O paths in ext4 and f2fs, while also optimizing out
unneeded code when !CONFIG_FS_ENCRYPTION.
Link: https://lore.kernel.org/r/20191209205021.231767-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Extend the FS_IOC_ADD_ENCRYPTION_KEY ioctl to allow the raw key to be
specified by a Linux keyring key, rather than specified directly.
This is useful because fscrypt keys belong to a particular filesystem
instance, so they are destroyed when that filesystem is unmounted.
Usually this is desired. But in some cases, userspace may need to
unmount and re-mount the filesystem while keeping the keys, e.g. during
a system update. This requires keeping the keys somewhere else too.
The keys could be kept in memory in a userspace daemon. But depending
on the security architecture and assumptions, it can be preferable to
keep them only in kernel memory, where they are unreadable by userspace.
We also can't solve this by going back to the original fscrypt API
(where for each file, the master key was looked up in the process's
keyring hierarchy) because that caused lots of problems of its own.
Therefore, add the ability for FS_IOC_ADD_ENCRYPTION_KEY to accept a
Linux keyring key. This solves the problem by allowing userspace to (if
needed) save the keys securely in a Linux keyring for re-provisioning,
while still using the new fscrypt key management ioctls.
This is analogous to how dm-crypt accepts a Linux keyring key, but the
key is then stored internally in the dm-crypt data structures rather
than being looked up again each time the dm-crypt device is accessed.
Use a custom key type "fscrypt-provisioning" rather than one of the
existing key types such as "logon". This is strongly desired because it
enforces that these keys are only usable for a particular purpose: for
fscrypt as input to a particular KDF. Otherwise, the keys could also be
passed to any kernel API that accepts a "logon" key with any service
prefix, e.g. dm-crypt, UBIFS, or (recently proposed) AF_ALG. This would
risk leaking information about the raw key despite it ostensibly being
unreadable. Of course, this mistake has already been made for multiple
kernel APIs; but since this is a new API, let's do it right.
This patch has been tested using an xfstest which I wrote to test it.
Link: https://lore.kernel.org/r/20191119222447.226853-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Export lookup_user_key() symbol in order to allow nvdimm passphrase
update to retrieve user injected keys.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Since ->d_compare() and ->d_hash() can be called in RCU-walk mode,
->d_parent and ->d_inode can be concurrently modified, and in
particular, ->d_inode may be changed to NULL. For f2fs_d_hash() this
resulted in a reproducible NULL dereference if a lookup is done in a
directory being deleted, e.g. with:
int main()
{
if (fork()) {
for (;;) {
mkdir("subdir", 0700);
rmdir("subdir");
}
} else {
for (;;)
access("subdir/file", 0);
}
}
... or by running the 't_encrypted_d_revalidate' program from xfstests.
Both repros work in any directory on a filesystem with the encoding
feature, even if the directory doesn't actually have the casefold flag.
I couldn't reproduce a crash in f2fs_d_compare(), but it appears that a
similar crash is possible there.
Fix these bugs by reading ->d_parent and ->d_inode using READ_ONCE() and
falling back to the case sensitive behavior if the inode is NULL.
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Fixes: 2c2eb7a300 ("f2fs: Support case-insensitive file name lookups")
Cc: <stable@vger.kernel.org> # v5.4+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Do the name comparison for non-casefolded directories correctly.
This is analogous to ext4's commit 66883da1ee ("ext4: fix dcache
lookup of !casefolded directories").
Fixes: 2c2eb7a300 ("f2fs: Support case-insensitive file name lookups")
Cc: <stable@vger.kernel.org> # v5.4+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Currently f2fs stats are only available from /d/f2fs/status. This patch
adds some of the f2fs stats to sysfs so that they are accessible even
when debugfs is not mounted.
The following sysfs nodes are added:
-/sys/fs/f2fs/<disk>/free_segments
-/sys/fs/f2fs/<disk>/cp_foreground_calls
-/sys/fs/f2fs/<disk>/cp_background_calls
-/sys/fs/f2fs/<disk>/gc_foreground_calls
-/sys/fs/f2fs/<disk>/gc_background_calls
-/sys/fs/f2fs/<disk>/moved_blocks_foreground
-/sys/fs/f2fs/<disk>/moved_blocks_background
-/sys/fs/f2fs/<disk>/avg_vblocks
Signed-off-by: Hridya Valsaraju <hridya@google.com>
[Jaegeuk Kim: allow STAT_FS without DEBUG_FS]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch merges the sysfs node documentation present in
Documentation/filesystems/f2fs.txt and
Documentation/ABI/testing/sysfs-fs-f2fs
and deletes the duplicate information from
Documentation/filesystems/f2fs.txt. This is to prevent having to update
both files when a new sysfs node is added for f2fs.
The patch also makes minor formatting changes to
Documentation/ABI/testing/sysfs-fs-f2fs.
Signed-off-by: Hridya Valsaraju <hridya@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Mutex lock won't serialize callers, in order to avoid starving of unlucky
caller, let's use rwsem lock instead.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Setting 0x40 in /sys/fs/f2fs/dev/ipu_policy gives a way to turn off
bio cache, which is useufl to check whether block layer using hardware
encryption engine merges IOs correctly.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Remove the duplicate CP_UMOUNT enum and add the new CP_PAUSE
enum to show the checkpoint reason in the trace prints.
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Without any form of coordination, any case where multiple allocations
from the same mempool are needed at a time to make forward progress can
deadlock under memory pressure.
This is the case for struct bio_post_read_ctx, as one can be allocated
to decrypt a Merkle tree page during fsverity_verify_bio(), which itself
is running from a post-read callback for a data bio which has its own
struct bio_post_read_ctx.
Fix this by freeing first bio_post_read_ctx before calling
fsverity_verify_bio(). This works because verity (if enabled) is always
the last post-read step.
This deadlock can be reproduced by trying to read from an encrypted
verity file after reducing NUM_PREALLOC_POST_READ_CTXS to 1 and patching
mempool_alloc() to pretend that pool->alloc() always fails.
Note that since NUM_PREALLOC_POST_READ_CTXS is actually 128, to actually
hit this bug in practice would require reading from lots of encrypted
verity files at the same time. But it's theoretically possible, as N
available objects doesn't guarantee forward progress when > N/2 threads
each need 2 objects at a time.
Fixes: 95ae251fe8 ("f2fs: add fs-verity support")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Since allocating an object from a mempool never fails when
__GFP_DIRECT_RECLAIM (which is included in GFP_NOFS) is set, the check
for failure to allocate a bio_post_read_ctx is unnecessary. Remove it.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If we hit an error during rename, we'll get two dentries in different
directories.
Chao adds to check the room in inline_dir which can avoid needless
inversion. This should be done by inode_lock(&old_dir).
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If kobject_init_and_add() failed, caller needs to invoke kobject_put()
to release kobject explicitly.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As Youling reported in mailing list:
https://www.linuxquestions.org/questions/linux-newbie-8/the-file-system-f2fs-is-broken-4175666043/https://www.linux.org/threads/the-file-system-f2fs-is-broken.26490/
There is a test case can corrupt f2fs image:
- dd if=/dev/zero of=/swapfile bs=1M count=4096
- chmod 600 /swapfile
- mkswap /swapfile
- swapon --discard /swapfile
The root cause is f2fs_swap_activate() intends to return zero value
to setup_swap_extents() to enable SWP_FS mode (swap file goes through
fs), in this flow, setup_swap_extents() setups swap extent with wrong
block address range, result in discard_swap() erasing incorrect address.
Because f2fs_swap_activate() has pinned swapfile, its data block
address will not change, it's safe to let swap to handle IO through
raw device, so we can get rid of SWAP_FS mode and initial swap extents
inside f2fs_swap_activate(), by this way, later discard_swap() can trim
in right address range.
Fixes: 4969c06a0d ("f2fs: support swap file w/ DIO")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch tries to support compression in f2fs.
- New term named cluster is defined as basic unit of compression, file can
be divided into multiple clusters logically. One cluster includes 4 << n
(n >= 0) logical pages, compression size is also cluster size, each of
cluster can be compressed or not.
- In cluster metadata layout, one special flag is used to indicate cluster
is compressed one or normal one, for compressed cluster, following metadata
maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs stores
data including compress header and compressed data.
- In order to eliminate write amplification during overwrite, F2FS only
support compression on write-once file, data can be compressed only when
all logical blocks in file are valid and cluster compress ratio is lower
than specified threshold.
- To enable compression on regular inode, there are three ways:
* chattr +c file
* chattr +c dir; touch dir/file
* mount w/ -o compress_extension=ext; touch file.ext
Compress metadata layout:
[Dnode Structure]
+-----------------------------------------------+
| cluster 1 | cluster 2 | ......... | cluster N |
+-----------------------------------------------+
. . . .
. . . .
. Compressed Cluster . . Normal Cluster .
+----------+---------+---------+---------+ +---------+---------+---------+---------+
|compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 |
+----------+---------+---------+---------+ +---------+---------+---------+---------+
. .
. .
. .
+-------------+-------------+----------+----------------------------+
| data length | data chksum | reserved | compressed data |
+-------------+-------------+----------+----------------------------+
Changelog:
20190326:
- fix error handling of read_end_io().
- remove unneeded comments in f2fs_encrypt_one_page().
20190327:
- fix wrong use of f2fs_cluster_is_full() in f2fs_mpage_readpages().
- don't jump into loop directly to avoid uninitialized variables.
- add TODO tag in error path of f2fs_write_cache_pages().
20190328:
- fix wrong merge condition in f2fs_read_multi_pages().
- check compressed file in f2fs_post_read_required().
20190401
- allow overwrite on non-compressed cluster.
- check cluster meta before writing compressed data.
20190402
- don't preallocate blocks for compressed file.
- add lz4 compress algorithm
- process multiple post read works in one workqueue
Now f2fs supports processing post read work in multiple workqueue,
it shows low performance due to schedule overhead of multiple
workqueue executing orderly.
20190921
- compress: support buffered overwrite
C: compress cluster flag
V: valid block address
N: NEW_ADDR
One cluster contain 4 blocks
before overwrite after overwrite
- VVVV -> CVNN
- CVNN -> VVVV
- CVNN -> CVNN
- CVNN -> CVVV
- CVVV -> CVNN
- CVVV -> CVVV
20191029
- add kconfig F2FS_FS_COMPRESSION to isolate compression related
codes, add kconfig F2FS_FS_{LZO,LZ4} to cover backend algorithm.
note that: will remove lzo backend if Jaegeuk agreed that too.
- update codes according to Eric's comments.
20191101
- apply fixes from Jaegeuk
20191113
- apply fixes from Jaegeuk
- split workqueue for fsverity
20191216
- apply fixes from Jaegeuk
[Jaegeuk Kim]
- add tracepoint for f2fs_{,de}compress_pages()
- fix many bugs and add some compression stats
- fix overwrite/mmap bugs
- address 32bit build error, reported by Geert.
- bug fixes when handling errors and i_compressed_blocks
Reported-by: <noreply@ellerman.id.au>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_rename(), new_page is gone after f2fs_set_link(), but it tries
to put again when whiteout is failed and jumped to put_out_dir.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
META_MAPPING is used to move blocks for both encrypted and verity files.
So the META_MAPPING invalidation condition in do_checkpoint() should
consider verity too, not just encrypt.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In low memory scenario, we can allocate multiple bios without
submitting any of them.
- f2fs_write_checkpoint()
- block_operations()
- f2fs_sync_node_pages()
step 1) flush cold nodes, allocate new bio from mempool
- bio_alloc()
- mempool_alloc()
step 2) flush hot nodes, allocate a bio from mempool
- bio_alloc()
- mempool_alloc()
step 3) flush warm nodes, be stuck in below call path
- bio_alloc()
- mempool_alloc()
- loop to wait mempool element release, as we only
reserved memory for two bio allocation, however above
allocated two bios may never be submitted.
So we need avoid using default bioset, in this patch we introduce a
private bioset, in where we enlarg mempool element count to total
number of log header, so that we can make sure we have enough
backuped memory pool in scenario of allocating/holding multiple
bios.
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Remove duplicate sbi->aw_cnt stats counter that tracks
the number of atomic files currently opened (it also shows
incorrect value sometimes). Use more relit lable sbi->atomic_files
to show in the stats.
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Otherwise, it can cause circular locking dependency reported by mm.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch avoids some unnecessary locks for quota files when write_begin
fails.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Otherwise, we can hit deadlock by waiting for the locked page in
move_data_block in GC.
Thread A Thread B
- do_page_mkwrite
- f2fs_vm_page_mkwrite
- lock_page
- f2fs_balance_fs
- mutex_lock(gc_mutex)
- f2fs_gc
- do_garbage_collect
- ra_data_block
- grab_cache_page
- f2fs_balance_fs
- mutex_lock(gc_mutex)
Fixes: 39a8695824 ("f2fs: refactor ->page_mkwrite() flow")
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The previous preallocation and DIO decision like below.
allow_outplace_dio !allow_outplace_dio
f2fs_force_buffered_io (*) No_Prealloc / Buffered_IO Prealloc / Buffered_IO
!f2fs_force_buffered_io No_Prealloc / DIO Prealloc / DIO
But, Javier reported Case (*) where zoned device bypassed preallocation but
fell back to buffered writes in f2fs_direct_IO(), resulting in stale data
being read.
In order to fix the issue, actually we need to preallocate blocks whenever
we fall back to buffered IO like this. No change is made in the other cases.
allow_outplace_dio !allow_outplace_dio
f2fs_force_buffered_io (*) Prealloc / Buffered_IO Prealloc / Buffered_IO
!f2fs_force_buffered_io No_Prealloc / DIO Prealloc / DIO
Reported-and-tested-by: Javier Gonzalez <javier@javigon.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Javier González <javier@javigon.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Set the STATX_ATTR_VERITY bit when the statx() system call is used on a
verity file on ext4.
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Eric Biggers <ebiggers@google.com>