Commit Graph

784127 Commits

Author SHA1 Message Date
Chao Yu
80b3dc7dd9 f2fs: fix to avoid accessing uninitialized field of inode page in is_alive()
If inode is newly created, inode page may not synchronize with inode cache,
so fields like .i_inline or .i_extra_isize could be wrong, in below call
path, we may access such wrong fields, result in failing to migrate valid
target block.

Thread A				Thread B
- f2fs_create
 - f2fs_add_link
  - f2fs_add_dentry
   - f2fs_init_inode_metadata
    - f2fs_add_inline_entry
     - f2fs_new_inode_page
     - f2fs_put_page
     : inode page wasn't updated with inode cache
					- gc_data_segment
					 - is_alive
					  - f2fs_get_node_page
					  - datablock_addr
					   - offset_in_addr
					   : access uninitialized fields

Fixes: 7a2af766af ("f2fs: enhance on-disk inode structure scalability")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:12:03 -07:00
Jaegeuk Kim
b8fa479c71 f2fs: avoid infinite GC loop due to stale atomic files
If committing atomic pages is failed when doing f2fs_do_sync_file(), we can
get commited pages but atomic_file being still set like:

- inmem:    0, atomic IO:    4 (Max.   10), volatile IO:    0 (Max.    0)

If GC selects this block, we can get an infinite loop like this:

f2fs_submit_page_bio: dev = (253,7), ino = 2, page_index = 0x2359a8, oldaddr = 0x2359a8, newaddr = 0x2359a8, rw = READ(), type = COLD_DATA
f2fs_submit_read_bio: dev = (253,7)/(253,7), rw = READ(), DATA, sector = 18533696, size = 4096
f2fs_get_victim: dev = (253,7), type = No TYPE, policy = (Foreground GC, LFS-mode, Greedy), victim = 4355, cost = 1, ofs_unit = 1, pre_victim_secno = 4355, prefree = 0, free = 234
f2fs_iget: dev = (253,7), ino = 6247, pino = 5845, i_mode = 0x81b0, i_size = 319488, i_nlink = 1, i_blocks = 624, i_advise = 0x2c
f2fs_submit_page_bio: dev = (253,7), ino = 2, page_index = 0x2359a8, oldaddr = 0x2359a8, newaddr = 0x2359a8, rw = READ(), type = COLD_DATA
f2fs_submit_read_bio: dev = (253,7)/(253,7), rw = READ(), DATA, sector = 18533696, size = 4096
f2fs_get_victim: dev = (253,7), type = No TYPE, policy = (Foreground GC, LFS-mode, Greedy), victim = 4355, cost = 1, ofs_unit = 1, pre_victim_secno = 4355, prefree = 0, free = 234
f2fs_iget: dev = (253,7), ino = 6247, pino = 5845, i_mode = 0x81b0, i_size = 319488, i_nlink = 1, i_blocks = 624, i_advise = 0x2c

In that moment, we can observe:

[Before]
Try to move 5084219 blocks (BG: 384508)
  - data blocks : 4962373 (274483)
  - node blocks : 121846 (110025)
Skipped : atomic write 4534686 (10)

[After]
Try to move 5088973 blocks (BG: 384508)
  - data blocks : 4967127 (274483)
  - node blocks : 121846 (110025)
Skipped : atomic write 4539440 (10)

So, refactor atomic_write flow like this:
1. start_atomic_write
 - add inmem_list and set atomic_file

2. write()
 - register it in inmem_pages

3. commit_atomic_write
 - if no error, f2fs_drop_inmem_pages()
 - f2fs_commit_inmme_pages() failed
   : __revoked_inmem_pages() was done
 - f2fs_do_sync_file failed
   : abort_atomic_write later

4. abort_atomic_write
 - f2fs_drop_inmem_pages

5. f2fs_drop_inmem_pages
 - clear atomic_file
 - remove inmem_list

Based on this change, when GC fails to move block in atomic_file,
f2fs_drop_inmem_pages_all() can call f2fs_drop_inmem_pages().

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:12:03 -07:00
Sahitya Tummala
0509aa66b2 f2fs: Fix indefinite loop in f2fs_gc()
Policy - foreground GC, LFS mode and greedy GC mode.

Under this policy, f2fs_gc() loops forever to GC as it doesn't have
enough free segements to proceed and thus it keeps calling gc_more
for the same victim segment.  This can happen if the selected victim
segment could not be GC'd due to failed blkaddr validity check i.e.
is_alive() returns false for the blocks set in current validity map.

Fix this by not resetting the sbi->cur_victim_sec to NULL_SEGNO, when
the segment selected could not be GC'd. This helps to select another
segment for GC and thus helps to proceed forward with GC.

[Note]
This can happen due to is_alive as well as atomic_file which skipps
GC.

Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:12:03 -07:00
Jaegeuk Kim
eb834a707e f2fs: convert inline_data in prior to i_size_write
In below call path, we change i_size before inline conversion, however,
if we failed to convert inline inode, the inode may have wrong i_size
which is larger than max inline size, result inline inode corruption.

- f2fs_setattr
 - truncate_setsize
 - f2fs_convert_inline_inode

This patch reorders truncate_setsize() and f2fs_convert_inline_inode()
to guarantee inline_data has valid i_size.

Fixes: 0cab80ee0c ("f2fs: fix to convert inline inode in ->setattr")
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:12:03 -07:00
Chao Yu
353f7c939f f2fs: fix error path of f2fs_convert_inline_page()
In error path of f2fs_convert_inline_page(), we missed to truncate newly
reserved block in .i_addrs[0] once we failed in get_node_info(), fix it.

Fixes: 7735730d39 ("f2fs: fix to propagate error from __get_meta_page()")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:12:03 -07:00
Chao Yu
647b4e71ce f2fs: add missing documents of reserve_root/resuid/resgid
Add missing documents.

Fixes: 7e65be49ed ("f2fs: add reserved blocks for root user")
Fixes: 7c2e59632b ("f2fs: add resgid and resuid to reserve root blocks")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:12:03 -07:00
Jaegeuk Kim
10d8cca295 f2fs: fix flushing node pages when checkpoint is disabled
This patch fixes skipping node page writes when checkpoint is disabled.
In this period, we can't rely on checkpoint to flush node pages.

Fixes: fd8c8caf7e ("f2fs: let checkpoint flush dnode page of regular")
Fixes: 4354994f09 ("f2fs: checkpoint disabling")
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:12:02 -07:00
Chao Yu
5fb0445cc3 f2fs: enhance f2fs_is_checkpoint_ready()'s readability
This patch changes sematics of f2fs_is_checkpoint_ready()'s return
value as: return true when checkpoint is ready, other return false,
it can improve readability of below conditions.

f2fs_submit_page_write()
...
	if (is_sbi_flag_set(sbi, SBI_IS_SHUTDOWN) ||
				!f2fs_is_checkpoint_ready(sbi))
		__submit_merged_bio(io);

f2fs_balance_fs()
...
	if (!f2fs_is_checkpoint_ready(sbi))
		return;

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:12:02 -07:00
Chao Yu
31eb8b4b6d f2fs: clean up __bio_alloc()'s parameter
Just cleanup, no logic change.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:12:02 -07:00
Chao Yu
724d36415b f2fs: fix wrong error injection path in inc_valid_block_count()
If FAULT_BLOCK type error injection is on, in inc_valid_block_count()
we may decrease sbi->alloc_valid_block_count percpu stat count
incorrectly, fix it.

Fixes: 36b877af79 ("f2fs: Keep alloc_valid_block_count in sync")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:12:02 -07:00
Chao Yu
b23649e363 f2fs: fix to writeout dirty inode during node flush
As Eric reported:

On xfstest generic/204 on f2fs, I'm getting a kernel BUG.

 allocate_segment_by_default+0x9d/0x100 [f2fs]
 f2fs_allocate_data_block+0x3c0/0x5c0 [f2fs]
 do_write_page+0x62/0x110 [f2fs]
 f2fs_do_write_node_page+0x2b/0xa0 [f2fs]
 __write_node_page+0x2ec/0x590 [f2fs]
 f2fs_sync_node_pages+0x756/0x7e0 [f2fs]
 block_operations+0x25b/0x350 [f2fs]
 f2fs_write_checkpoint+0x104/0x1150 [f2fs]
 f2fs_sync_fs+0xa2/0x120 [f2fs]
 f2fs_balance_fs_bg+0x33c/0x390 [f2fs]
 f2fs_write_node_pages+0x4c/0x1f0 [f2fs]
 do_writepages+0x1c/0x70
 __writeback_single_inode+0x45/0x320
 writeback_sb_inodes+0x273/0x5c0
 wb_writeback+0xff/0x2e0
 wb_workfn+0xa1/0x370
 process_one_work+0x138/0x350
 worker_thread+0x4d/0x3d0
 kthread+0x109/0x140

The root cause of this issue is, in a very small partition, e.g.
in generic/204 testcase of fstest suit, filesystem's free space
is 50MB, so at most we can write 12800 inline inode with command:
`echo XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX > $SCRATCH_MNT/$i`,
then filesystem will have:
- 12800 dirty inline data page
- 12800 dirty inode page
- and 12800 dirty imeta (dirty inode)

When we flush node-inode's page cache, we can also flush inline
data with each inode page, however it will run out-of-free-space
in device, then once it triggers checkpoint, there is no room for
huge number of imeta, at this time, GC is useless, as there is no
dirty segment at all.

In order to fix this, we try to recognize inode page during
node_inode's page flushing, and update inode page from dirty inode,
so that later another imeta (dirty inode) flush can be avoided.

Reported-and-tested-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:12:02 -07:00
Chao Yu
c3d79b7553 f2fs: optimize case-insensitive lookups
This patch ports below casefold enhancement patch from ext4 to f2fs

commit 3ae72562ad ("ext4: optimize case-insensitive lookups")

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:12:02 -07:00
Chao Yu
99b78236a3 f2fs: introduce f2fs_match_name() for cleanup
This patch introduces f2fs_match_name() for cleanup.

BTW, it avoids to fallback to normal comparison once it doesn't
match casefolded name.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:12:02 -07:00
Sahitya Tummala
bbf253738a f2fs: Fix indefinite loop in f2fs_gc()
Policy - Foreground GC, LFS and greedy GC mode.

Under this policy, f2fs_gc() loops forever to GC as it doesn't have
enough free segements to proceed and thus it keeps calling gc_more
for the same victim segment.  This can happen if the selected victim
segment could not be GC'd due to failed blkaddr validity check i.e.
is_alive() returns false for the blocks set in current validity map.

Fix this by keeping track of such invalid segments and skip those
segments for selection in get_victim_by_default() to avoid endless
GC loop under such error scenarios. Currently, add this logic under
CONFIG_F2FS_CHECK_FS to be able to root cause the issue in debug
version.

Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: fix wrong bitmap size]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:12:01 -07:00
Chao Yu
f1383d803c f2fs: allocate memory in batch in build_sit_info()
build_sit_info() allocate all bitmaps for each segment one by one,
it's quite low efficiency, this pach changes to allocate large
continuous memory at a time, and divide it and assign for each bitmaps
of segment. For large size image, it can expect improving its mount
speed.

Signed-off-by: Chen Gong <gongchen4@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:12:01 -07:00
Chao Yu
2f7cc89b03 f2fs: support FS_IOC_{GET,SET}FSLABEL
Support two generic fs ioctls FS_IOC_{GET,SET}FSLABEL, letting
f2fs pass generic/492 testcase.

Fixes were made by Eric where:
 - f2fs: fix buffer overruns in FS_IOC_{GET, SET}FSLABEL
   utf16s_to_utf8s() and utf8s_to_utf16s() take the number of characters,
   not the number of bytes.

 - f2fs: fix copying too many bytes in FS_IOC_SETFSLABEL
   Userspace provides a null-terminated string, so don't assume that the
   full FSLABEL_MAX bytes can always be copied.

 - f2fs: add missing authorization check in FS_IOC_SETFSLABEL
   FS_IOC_SETFSLABEL modifies the filesystem superblock, so it shouldn't be
   allowed to regular users.  Require CAP_SYS_ADMIN, like xfs and btrfs do.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:12:01 -07:00
Chao Yu
3991967c44 f2fs: fix to avoid data corruption by forbidding SSR overwrite
There is one case can cause data corruption.

- write 4k to fileA
- fsync fileA, 4k data is writebacked to lbaA
- write 4k to fileA
- kworker flushs 4k to lbaB; dnode contain lbaB didn't be persisted yet
- write 4k to fileB
- kworker flush 4k to lbaA due to SSR
- SPOR -> dnode with lbaA will be recovered, however lbaA contains fileB's
data

One solution is tracking all fsynced file's block history, and disallow
SSR overwrite on newly invalidated block on that file.

However, during recovery, no matter the dnode is flushed or fsynced, all
previous dnodes until last fsynced one in node chain can be recovered,
that means we need to record all block change in flushed dnode, which
will cause heavy cost, so let's just use simple fix by forbidding SSR
overwrite directly.

Fixes: 5b6c6be2d8 ("f2fs: use SSR for warm node as well")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:12:01 -07:00
YueHaibing
e146fd54db f2fs: Fix build error while CONFIG_NLS=m
If CONFIG_F2FS_FS=y but CONFIG_NLS=m, building fails:

fs/f2fs/file.o: In function `f2fs_ioctl':
file.c:(.text+0xb86f): undefined reference to `utf16s_to_utf8s'
file.c:(.text+0xe651): undefined reference to `utf8s_to_utf16s'

Select CONFIG_NLS to fix this.

Reported-by: Hulk Robot <hulkci@huawei.com>
Fixes: 61a3da4d5ef8 ("f2fs: support FS_IOC_{GET,SET}FSLABEL")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:12:01 -07:00
Chao Yu
9d1e349682 Revert "f2fs: avoid out-of-range memory access"
As Pavel Machek reported:

"We normally use -EUCLEAN to signal filesystem corruption. Plus, it is
good idea to report it to the syslog and mark filesystem as "needing
fsck" if filesystem can do that."

Still we need improve the original patch with:
- use unlikely keyword
- add message print
- return EUCLEAN

However, after rethink this patch, I don't think we should add such
condition check here as below reasons:
- We have already checked the field in f2fs_sanity_check_ckpt(),
- If there is fs corrupt or security vulnerability, there is nothing
to guarantee the field is integrated after the check, unless we do
the check before each of its use, however no filesystem does that.
- We only have similar check for bitmap, which was added due to there
is bitmap corruption happened on f2fs' runtime in product.
- There are so many key fields in SB/CP/NAT did have such check
after f2fs_sanity_check_{sb,cp,..}.

So I propose to revert this unneeded check.

This reverts commit 56f3ce6751.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:12:01 -07:00
Lihong Kou
a631f0de4b f2fs: cleanup the code in build_sit_entries.
We do not need to set the SBI_NEED_FSCK flag in the error paths, if we
return error here, we will not update the checkpoint flag, so the code
is useless, just remove it.

Signed-off-by: Lihong Kou <koulihong@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:12:00 -07:00
Chao Yu
d6143939e1 f2fs: fix wrong available node count calculation
In mkfs, we have counted quota file's node number in cp.valid_node_count,
so we have to avoid wrong substraction of quota node number in
.available_nid/.avail_node_count calculation.

f2fs_write_check_point_pack()
{
..
	set_cp(valid_node_count, 1 + c.quota_inum + c.lpf_inum);

Fixes: 292c196a36 ("f2fs: reserve nid resource for quota sysfile")
Fixes: 7b63f72f73 ("f2fs: fix to do sanity check on valid node/block count")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:12:00 -07:00
Lihong Kou
9b6fb78ef9 f2fs: remove duplicate code in f2fs_file_write_iter
We will do the same check in generic_write_checks.
if (iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT)
        return -EINVAL;
just remove the same check in f2fs_file_write_iter.

Signed-off-by: Lihong Kou <koulihong@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:12:00 -07:00
Chao Yu
39ebaaa82e f2fs: fix to migrate blocks correctly during defragment
During defragment, we missed to trigger fragmented blocks migration
for below condition:

In defragment region:
- total number of valid blocks is smaller than 512;
- the tail part of the region are all holes;

In addtion, return zero to user via range->len if there is no
fragmented blocks.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:12:00 -07:00
Chao Yu
0e87a048ef f2fs: use wrapped f2fs_cp_error()
Just cleanup, no logic change.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:12:00 -07:00
Chao Yu
216de9d1ec f2fs: fix to use more generic EOPNOTSUPP
EOPNOTSUPP is widely used as error number indicating operation is
not supported in syscall, and ENOTSUPP was defined and only used
for NFSv3 protocol, so use EOPNOTSUPP instead.

Fixes: 0a2aa8fbb9 ("f2fs: refactor __exchange_data_block for speed up")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:12:00 -07:00
Chao Yu
e0185895b2 f2fs: use wrapped IS_SWAPFILE()
Just cleanup, no logic change.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:12:00 -07:00
Daniel Rosenberg
b1951281ef f2fs: Support case-insensitive file name lookups
Modeled after commit b886ee3e77 ("ext4: Support case-insensitive file
name lookups")

"""
This patch implements the actual support for case-insensitive file name
lookups in f2fs, based on the feature bit and the encoding stored in the
superblock.

A filesystem that has the casefold feature set is able to configure
directories with the +F (F2FS_CASEFOLD_FL) attribute, enabling lookups
to succeed in that directory in a case-insensitive fashion, i.e: match
a directory entry even if the name used by userspace is not a byte per
byte match with the disk name, but is an equivalent case-insensitive
version of the Unicode string.  This operation is called a
case-insensitive file name lookup.

The feature is configured as an inode attribute applied to directories
and inherited by its children.  This attribute can only be enabled on
empty directories for filesystems that support the encoding feature,
thus preventing collision of file names that only differ by case.

* dcache handling:

For a +F directory, F2Fs only stores the first equivalent name dentry
used in the dcache. This is done to prevent unintentional duplication of
dentries in the dcache, while also allowing the VFS code to quickly find
the right entry in the cache despite which equivalent string was used in
a previous lookup, without having to resort to ->lookup().

d_hash() of casefolded directories is implemented as the hash of the
casefolded string, such that we always have a well-known bucket for all
the equivalencies of the same string. d_compare() uses the
utf8_strncasecmp() infrastructure, which handles the comparison of
equivalent, same case, names as well.

For now, negative lookups are not inserted in the dcache, since they
would need to be invalidated anyway, because we can't trust missing file
dentries.  This is bad for performance but requires some leveraging of
the vfs layer to fix.  We can live without that for now, and so does
everyone else.

* on-disk data:

Despite using a specific version of the name as the internal
representation within the dcache, the name stored and fetched from the
disk is a byte-per-byte match with what the user requested, making this
implementation 'name-preserving'. i.e. no actual information is lost
when writing to storage.

DX is supported by modifying the hashes used in +F directories to make
them case/encoding-aware.  The new disk hashes are calculated as the
hash of the full casefolded string, instead of the string directly.
This allows us to efficiently search for file names in the htree without
requiring the user to provide an exact name.

* Dealing with invalid sequences:

By default, when a invalid UTF-8 sequence is identified, ext4 will treat
it as an opaque byte sequence, ignoring the encoding and reverting to
the old behavior for that unique file.  This means that case-insensitive
file name lookup will not work only for that file.  An optional bit can
be set in the superblock telling the filesystem code and userspace tools
to enforce the encoding.  When that optional bit is set, any attempt to
create a file name using an invalid UTF-8 sequence will fail and return
an error to userspace.

* Normalization algorithm:

The UTF-8 algorithms used to compare strings in f2fs is implemented
in fs/unicode, and is based on a previous version developed by
SGI.  It implements the Canonical decomposition (NFD) algorithm
described by the Unicode specification 12.1, or higher, combined with
the elimination of ignorable code points (NFDi) and full
case-folding (CF) as documented in fs/unicode/utf8_norm.c.

NFD seems to be the best normalization method for F2FS because:

  - It has a lower cost than NFC/NFKC (which requires
    decomposing to NFD as an intermediary step)
  - It doesn't eliminate important semantic meaning like
    compatibility decompositions.

Although:

- This implementation is not completely linguistic accurate, because
different languages have conflicting rules, which would require the
specialization of the filesystem to a given locale, which brings all
sorts of problems for removable media and for users who use more than
one language.
"""

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:11:59 -07:00
Daniel Rosenberg
c69fe4fbf0 f2fs: include charset encoding information in the superblock
Add charset encoding to f2fs to support casefolding. It is modeled after
the same feature introduced in commit c83ad55eaa ("ext4: include charset
encoding information in the superblock")

Currently this is not compatible with encryption, similar to the current
ext4 imlpementation. This will change in the future.

>From the ext4 patch:
"""
The s_encoding field stores a magic number indicating the encoding
format and version used globally by file and directory names in the
filesystem.  The s_encoding_flags defines policies for using the charset
encoding, like how to handle invalid sequences.  The magic number is
mapped to the exact charset table, but the mapping is specific to ext4.
Since we don't have any commitment to support old encodings, the only
encoding I am supporting right now is utf8-12.1.0.

The current implementation prevents the user from enabling encoding and
per-directory encryption on the same filesystem at the same time.  The
incompatibility between these features lies in how we do efficient
directory searches when we cannot be sure the encryption of the user
provided fname will match the actual hash stored in the disk without
decrypting every directory entry, because of normalization cases.  My
quickest solution is to simply block the concurrent use of these
features for now, and enable it later, once we have a better solution.
"""

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:11:59 -07:00
Daniel Rosenberg
e304fb5ba0 fs: Reserve flag for casefolding
In preparation for including the casefold feature within f2fs, elevate
the EXT4_CASEFOLD_FL flag to FS_CASEFOLD_FL.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:11:59 -07:00
Chao Yu
a8037a8544 f2fs: fix to avoid call kvfree under spinlock
vfree() don't wish to be called from interrupt context, move it
out of spin_lock_irqsave() coverage.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:11:59 -07:00
Jia-Ju Bai
1cb57e2eea fs: f2fs: Remove unnecessary checks of SM_I(sbi) in update_general_status()
In fill_super() and put_super(), f2fs_destroy_stats() is called
in prior to f2fs_destroy_segment_manager(), so if current
sbi can still be visited in global stat list, SM_I(sbi) should be
released yet.
For this reason, SM_I(sbi) does not need to be checked in
update_general_status().
Thank Chao Yu for advice.

Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:11:59 -07:00
Chao Yu
eb44d8769b f2fs: disallow direct IO in atomic write
Atomic write needs page cache to cache data of transaction,
direct IO should never be allowed in atomic write, detect
and deny it when open atomic write file.

Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:11:59 -07:00
Chao Yu
c3d777c7b0 f2fs: fix to handle quota_{on,off} correctly
With quota_ino feature on, generic/232 reports an inconsistence issue
on the image.

The root cause is that the testcase tries to:
- use quotactl to shutdown journalled quota based on sysfile;
- and then use quotactl to enable/turn on quota based on specific file
(aquota.user or aquota.group).

Eventually, quota sysfile will be out-of-update due to following specific
file creation.

Change as below to fix this issue:
- deny enabling quota based on specific file if quota sysfile exists.
- set SBI_QUOTA_NEED_REPAIR once sysfile based quota shutdowns via
ioctl.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:11:59 -07:00
Chao Yu
fc42f9b12f f2fs: fix to detect cp error in f2fs_setxattr()
It needs to return -EIO if filesystem has been shutdown, fix the
miss case in f2fs_setxattr().

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:11:58 -07:00
Chao Yu
99317081e6 f2fs: fix to spread f2fs_is_checkpoint_ready()
We missed to call f2fs_is_checkpoint_ready() in several places, it may
allow space allocation even when free space was exhausted during
checkpoint is disabled, fix to add them.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:11:58 -07:00
Chao Yu
93720242e4 f2fs: support fiemap() for directory inode
Adjust f2fs_fiemap() to support fiemap() on directory inode.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:11:58 -07:00
Chao Yu
80140c1f21 f2fs: fix to avoid discard command leak
=============================================================================
 BUG discard_cmd (Tainted: G    B      OE  ): Objects remaining in discard_cmd on __kmem_cache_shutdown()
 -----------------------------------------------------------------------------

 INFO: Slab 0xffffe1ac481d22c0 objects=36 used=2 fp=0xffff936b4748bf50 flags=0x2ffff0000000100
 Call Trace:
  dump_stack+0x63/0x87
  slab_err+0xa1/0xb0
  __kmem_cache_shutdown+0x183/0x390
  shutdown_cache+0x14/0x110
  kmem_cache_destroy+0x195/0x1c0
  f2fs_destroy_segment_manager_caches+0x21/0x40 [f2fs]
  exit_f2fs_fs+0x35/0x641 [f2fs]
  SyS_delete_module+0x155/0x230
  ? vtime_user_exit+0x29/0x70
  do_syscall_64+0x6e/0x160
  entry_SYSCALL64_slow_path+0x25/0x25

 INFO: Object 0xffff936b4748b000 @offset=0
 INFO: Object 0xffff936b4748b070 @offset=112
 kmem_cache_destroy discard_cmd: Slab cache still has objects
 Call Trace:
  dump_stack+0x63/0x87
  kmem_cache_destroy+0x1b4/0x1c0
  f2fs_destroy_segment_manager_caches+0x21/0x40 [f2fs]
  exit_f2fs_fs+0x35/0x641 [f2fs]
  SyS_delete_module+0x155/0x230
  do_syscall_64+0x6e/0x160
  entry_SYSCALL64_slow_path+0x25/0x25

Recovery can cache discard commands, so in error path of fill_super(),
we need give a chance to handle them, otherwise it will lead to leak
of discard_cmd slab cache.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:11:58 -07:00
Chao Yu
d7be5b042f f2fs: fix to avoid tagging SBI_QUOTA_NEED_REPAIR incorrectly
On a quota disabled image, with fault injection, SBI_QUOTA_NEED_REPAIR
will be set incorrectly in error path of f2fs_evict_inode(), fix it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:11:58 -07:00
Chao Yu
d7895f7914 f2fs: fix to drop meta/node pages during umount
As reported in bugzilla:

https://bugzilla.kernel.org/show_bug.cgi?id=204193

A null pointer dereference bug is triggered in f2fs under kernel-5.1.3.

 kasan_report.cold+0x5/0x32
 f2fs_write_end_io+0x215/0x650
 bio_endio+0x26e/0x320
 blk_update_request+0x209/0x5d0
 blk_mq_end_request+0x2e/0x230
 lo_complete_rq+0x12c/0x190
 blk_done_softirq+0x14a/0x1a0
 __do_softirq+0x119/0x3e5
 irq_exit+0x94/0xe0
 call_function_single_interrupt+0xf/0x20

During umount, we will access NULL sbi->node_inode pointer in
f2fs_write_end_io():

	f2fs_bug_on(sbi, page->mapping == NODE_MAPPING(sbi) &&
				page->index != nid_of_node(page));

The reason is if disable_checkpoint mount option is on, meta dirty
pages can remain during umount, and then be flushed by iput() of
meta_inode, however node_inode has been iput()ed before
meta_inode's iput().

Since checkpoint is disabled, all meta/node datas are useless and
should be dropped in next mount, so in umount, let's adjust
drop_inode() to give a hint to iput_final() to drop all those dirty
datas correctly.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:11:58 -07:00
Chao Yu
a0a65a4a0e f2fs: disallow switching io_bits option during remount
If IO alignment feature is turned on after remount, we didn't
initialize mempool of it, it turns out we will encounter panic
during IO submission due to access NULL mempool pointer.

This feature should be set only at mount time, so simply deny
configuring during remount.

This fixes bug reported in bugzilla:

https://bugzilla.kernel.org/show_bug.cgi?id=204135

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:11:58 -07:00
Chao Yu
d7ebeff93f f2fs: fix panic of IO alignment feature
Since 07173c3ec2 ("block: enable multipage bvecs"), one bio vector
can store multi pages, so that we can not calculate max IO size of
bio as PAGE_SIZE * bio->bi_max_vecs. However IO alignment feature of
f2fs always has that assumption, so finally, it may cause panic during
IO submission as below stack.

 kernel BUG at fs/f2fs/data.c:317!
 RIP: 0010:__submit_merged_bio+0x8b0/0x8c0
 Call Trace:
  f2fs_submit_page_write+0x3cd/0xdd0
  do_write_page+0x15d/0x360
  f2fs_outplace_write_data+0xd7/0x210
  f2fs_do_write_data_page+0x43b/0xf30
  __write_data_page+0xcf6/0x1140
  f2fs_write_cache_pages+0x3ba/0xb40
  f2fs_write_data_pages+0x3dd/0x8b0
  do_writepages+0xbb/0x1e0
  __writeback_single_inode+0xb6/0x800
  writeback_sb_inodes+0x441/0x910
  wb_writeback+0x261/0x650
  wb_workfn+0x1f9/0x7a0
  process_one_work+0x503/0x970
  worker_thread+0x7d/0x820
  kthread+0x1ad/0x210
  ret_from_fork+0x35/0x40

This patch adds one extra condition to check left space in bio while
trying merging page to bio, to avoid panic.

This bug was reported in bugzilla:

https://bugzilla.kernel.org/show_bug.cgi?id=204043

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:11:57 -07:00
Chao Yu
6f1c7fb76e f2fs: introduce {page,io}_is_mergeable() for readability
Wrap merge condition into function for readability, no logic change.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:11:57 -07:00
Jaegeuk Kim
125f8ac104 f2fs: fix livelock in swapfile writes
This patch fixes livelock in the below call path when writing swap pages.

[46374.617256] c2    701  __switch_to+0xe4/0x100
[46374.617265] c2    701  __schedule+0x80c/0xbc4
[46374.617273] c2    701  schedule+0x74/0x98
[46374.617281] c2    701  rwsem_down_read_failed+0x190/0x234
[46374.617291] c2    701  down_read+0x58/0x5c
[46374.617300] c2    701  f2fs_map_blocks+0x138/0x9a8
[46374.617310] c2    701  get_data_block_dio_write+0x74/0x104
[46374.617320] c2    701  __blockdev_direct_IO+0x1350/0x3930
[46374.617331] c2    701  f2fs_direct_IO+0x55c/0x8bc
[46374.617341] c2    701  __swap_writepage+0x1d0/0x3e8
[46374.617351] c2    701  swap_writepage+0x44/0x54
[46374.617360] c2    701  shrink_page_list+0x140/0xe80
[46374.617371] c2    701  shrink_inactive_list+0x510/0x918
[46374.617381] c2    701  shrink_node_memcg+0x2d4/0x804
[46374.617391] c2    701  shrink_node+0x10c/0x2f8
[46374.617400] c2    701  do_try_to_free_pages+0x178/0x38c
[46374.617410] c2    701  try_to_free_pages+0x348/0x4b8
[46374.617419] c2    701  __alloc_pages_nodemask+0x7f8/0x1014
[46374.617429] c2    701  pagecache_get_page+0x184/0x2cc
[46374.617438] c2    701  f2fs_new_node_page+0x60/0x41c
[46374.617449] c2    701  f2fs_new_inode_page+0x50/0x7c
[46374.617460] c2    701  f2fs_init_inode_metadata+0x128/0x530
[46374.617472] c2    701  f2fs_add_inline_entry+0x138/0xd64
[46374.617480] c2    701  f2fs_do_add_link+0xf4/0x178
[46374.617488] c2    701  f2fs_create+0x1e4/0x3ac
[46374.617497] c2    701  path_openat+0xdc0/0x1308
[46374.617507] c2    701  do_filp_open+0x78/0x124
[46374.617516] c2    701  do_sys_open+0x134/0x248
[46374.617525] c2    701  SyS_openat+0x14/0x20

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-09-23 14:11:57 -07:00
Eric Biggers
b27113a45c f2fs: add fs-verity support
Add fs-verity support to f2fs.  fs-verity is a filesystem feature that
enables transparent integrity protection and authentication of read-only
files.  It uses a dm-verity like mechanism at the file level: a Merkle
tree is used to verify any block in the file in log(filesize) time.  It
is implemented mainly by helper functions in fs/verity/.  See
Documentation/filesystems/fsverity.rst for the full documentation.

The f2fs support for fs-verity consists of:

- Adding a filesystem feature flag and an inode flag for fs-verity.

- Implementing the fsverity_operations to support enabling verity on an
  inode and reading/writing the verity metadata.

- Updating ->readpages() to verify data as it's read from verity files
  and to support reading verity metadata pages.

- Updating ->write_begin(), ->write_end(), and ->writepages() to support
  writing verity metadata pages.

- Calling the fs-verity hooks for ->open(), ->setattr(), and ->ioctl().

Like ext4, f2fs stores the verity metadata (Merkle tree and
fsverity_descriptor) past the end of the file, starting at the first 64K
boundary beyond i_size.  This approach works because (a) verity files
are readonly, and (b) pages fully beyond i_size aren't visible to
userspace but can be read/written internally by f2fs with only some
relatively small changes to f2fs.  Extended attributes cannot be used
because (a) f2fs limits the total size of an inode's xattr entries to
4096 bytes, which wouldn't be enough for even a single Merkle tree
block, and (b) f2fs encryption doesn't encrypt xattrs, yet the verity
metadata *must* be encrypted when the file is because it contains hashes
of the plaintext data.

Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-09-23 14:11:57 -07:00
Eric Biggers
9771896780 ext4: update on-disk format documentation for fs-verity
Document the format of verity files on ext4, and the corresponding inode
and superblock flags.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-09-23 14:11:57 -07:00
Eric Biggers
f954fe23c5 ext4: add fs-verity read support
Make ext4_mpage_readpages() verify data as it is read from fs-verity
files, using the helper functions from fs/verity/.

To support both encryption and verity simultaneously, this required
refactoring the decryption workflow into a generic "post-read
processing" workflow which can do decryption, verification, or both.

The case where the ext4 block size is not equal to the PAGE_SIZE is not
supported yet, since in that case ext4_mpage_readpages() sometimes falls
back to block_read_full_page(), which does not support fs-verity yet.

Co-developed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-09-23 14:11:57 -07:00
Eric Biggers
5ed7b76f84 ext4: add basic fs-verity support
Add most of fs-verity support to ext4.  fs-verity is a filesystem
feature that enables transparent integrity protection and authentication
of read-only files.  It uses a dm-verity like mechanism at the file
level: a Merkle tree is used to verify any block in the file in
log(filesize) time.  It is implemented mainly by helper functions in
fs/verity/.  See Documentation/filesystems/fsverity.rst for the full
documentation.

This commit adds all of ext4 fs-verity support except for the actual
data verification, including:

- Adding a filesystem feature flag and an inode flag for fs-verity.

- Implementing the fsverity_operations to support enabling verity on an
  inode and reading/writing the verity metadata.

- Updating ->write_begin(), ->write_end(), and ->writepages() to support
  writing verity metadata pages.

- Calling the fs-verity hooks for ->open(), ->setattr(), and ->ioctl().

ext4 stores the verity metadata (Merkle tree and fsverity_descriptor)
past the end of the file, starting at the first 64K boundary beyond
i_size.  This approach works because (a) verity files are readonly, and
(b) pages fully beyond i_size aren't visible to userspace but can be
read/written internally by ext4 with only some relatively small changes
to ext4.  This approach avoids having to depend on the EA_INODE feature
and on rearchitecturing ext4's xattr support to support paging
multi-gigabyte xattrs into memory, and to support encrypting xattrs.
Note that the verity metadata *must* be encrypted when the file is,
since it contains hashes of the plaintext data.

This patch incorporates work by Theodore Ts'o and Chandan Rajendra.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-09-23 14:11:56 -07:00
Eric Biggers
0c22f68fc4 fs-verity: support builtin file signatures
To meet some users' needs, add optional support for having fs-verity
handle a portion of the authentication policy in the kernel.  An
".fs-verity" keyring is created to which X.509 certificates can be
added; then a sysctl 'fs.verity.require_signatures' can be set to cause
the kernel to enforce that all fs-verity files contain a signature of
their file measurement by a key in this keyring.

See the "Built-in signature verification" section of
Documentation/filesystems/fsverity.rst for the full documentation.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-09-23 14:11:56 -07:00
Eric Biggers
9b8425a7cd fs-verity: add SHA-512 support
Add SHA-512 support to fs-verity.  This is primarily a demonstration of
the trivial changes needed to support a new hash algorithm in fs-verity;
most users will still use SHA-256, due to the smaller space required to
store the hashes.  But some users may prefer SHA-512.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-09-23 14:11:56 -07:00
Eric Biggers
ebeb654881 fs-verity: implement FS_IOC_MEASURE_VERITY ioctl
Add a function for filesystems to call to implement the
FS_IOC_MEASURE_VERITY ioctl.  This ioctl retrieves the file measurement
that fs-verity calculated for the given file and is enforcing for reads;
i.e., reads that don't match this hash will fail.  This ioctl can be
used for authentication or logging of file measurements in userspace.

See the "FS_IOC_MEASURE_VERITY" section of
Documentation/filesystems/fsverity.rst for the documentation.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-09-23 14:11:56 -07:00