Pull f2fs update from Jaegeuk Kim:
"In this round, we've mainly focused on performance tuning and critical
bug fixes occurred in low-end devices. Sheng Yong introduced
lost_found feature to keep missing files during recovery instead of
thrashing them. We're preparing coming fsverity implementation. And,
we've got more features to communicate with users for better
performance. In low-end devices, some memory-related issues were
fixed, and subtle race condtions and corner cases were addressed as
well.
Enhancements:
- large nat bitmaps for more free node ids
- add three block allocation policies to pass down write hints given by user
- expose extension list to user and introduce hot file extension
- tune small devices seamlessly for low-end devices
- set readdir_ra by default
- give more resources under gc_urgent mode regarding to discard and cleaning
- introduce fsync_mode to enforce posix or not
- nowait aio support
- add lost_found feature to keep dangling inodes
- reserve bits for future fsverity feature
- add test_dummy_encryption for FBE
Bug fixes:
- don't use highmem for dentry pages
- align memory boundary for bitops
- truncate preallocated blocks in write errors
- guarantee i_times on fsync call
- clear CP_TRIMMED_FLAG correctly
- prevent node chain loop during recovery
- avoid data race between atomic write and background cleaning
- avoid unnecessary selinux violation warnings on resgid option
- GFP_NOFS to avoid deadlock in quota and read paths
- fix f2fs_skip_inode_update to allow i_size recovery
In addition to the above, there are several minor bug fixes and clean-ups"
Cherry-pick from origin/upstream-f2fs-stable-linux-4.9.y:
ac389af190fb f2fs: remain written times to update inode during fsync
270deeb87125 f2fs: make assignment of t->dentry_bitmap more readable
a4fa11c8da10 f2fs: truncate preallocated blocks in error case
4478970f0e73 f2fs: fix a wrong condition in f2fs_skip_inode_update
29cead58f5ea f2fs: reserve bits for fs-verity
848b293a5d95 f2fs: Add a segment type check in inplace write
2dc8f5a3a640 f2fs: no need to initialize zero value for GFP_F2FS_ZERO
83b9bb95a628 f2fs: don't track new nat entry in nat set
a33ce03ac477 f2fs: clean up with F2FS_BLK_ALIGN
a3f8ec8082e3 f2fs: check blkaddr more accuratly before issue a bio
034f11eadb16 f2fs: Set GF_NOFS in read_cache_page_gfp while doing f2fs_quota_read
aa5bcfd8f488 f2fs: introduce a new mount option test_dummy_encryption
9b880fe6e6e2 f2fs: introduce F2FS_FEATURE_LOST_FOUND feature
80d6489a08c1 f2fs: release locks before return in f2fs_ioc_gc_range()
9f1896c490eb f2fs: align memory boundary for bitops
c7930ee88334 f2fs: remove unneeded set_cold_node()
355d2346409a f2fs: add nowait aio support
e9a50e6b9479 f2fs: wrap all options with f2fs_sb_info.mount_opt
b6d2ec83e0c0 f2fs: Don't overwrite all types of node to keep node chain
9a954816298c f2fs: introduce mount option for fsync mode
4ce4eb697068 f2fs: fix to restore old mount option in ->remount_fs
8f711c344e61 f2fs: wrap sb_rdonly with f2fs_readonly
c07478ee84bf f2fs: avoid selinux denial on CAP_SYS_RESOURCE
ac734c416fa9 f2fs: support hot file extension
f4f10221accc f2fs: fix to avoid race in between atomic write and background GC
e87b13ec160b f2fs: do gc in greedy mode for whole range if gc_urgent mode is set
e9878588de94 f2fs: issue discard aggressively in the gc_urgent mode
ad3ce479e6e4 f2fs: set readdir_ra by default
5aae2026bbd2 f2fs: add auto tuning for small devices
78c1fc2d8f27 f2fs: add mount option for segment allocation policy
ecd02f564631 f2fs: don't stop GC if GC is contended
1e72cb27d2d6 f2fs: expose extension_list sysfs entry
061839d178ab f2fs: fix to set KEEP_SIZE bit in f2fs_zero_range
4951ebcbc4e2 f2fs: introduce sb_lock to make encrypt pwsalt update exclusive
939f6be0420f f2fs: remove redundant initialization of pointer 'p'
39bea4bc8ef2 f2fs: flush cp pack except cp pack 2 page at first
770611eb2ab4 f2fs: clean up f2fs_sb_has_xxx functions
4d8e4a8965f9 f2fs: remove redundant check of page type when submit bio
e9878588de94 f2fs: issue discard aggressively in the gc_urgent mode
ad3ce479e6e4 f2fs: set readdir_ra by default
5aae2026bbd2 f2fs: add auto tuning for small devices
78c1fc2d8f27 f2fs: add mount option for segment allocation policy
ecd02f564631 f2fs: don't stop GC if GC is contended
1e72cb27d2d6 f2fs: expose extension_list sysfs entry
061839d178ab f2fs: fix to set KEEP_SIZE bit in f2fs_zero_range
4951ebcbc4e2 f2fs: introduce sb_lock to make encrypt pwsalt update exclusive
939f6be0420f f2fs: remove redundant initialization of pointer 'p'
39bea4bc8ef2 f2fs: flush cp pack except cp pack 2 page at first
770611eb2ab4 f2fs: clean up f2fs_sb_has_xxx functions
4d8e4a8965f9 f2fs: remove redundant check of page type when submit bio
b57a37f01fda f2fs: fix to handle looped node chain during recovery
9ac5b8c54083 f2fs: handle quota for orphan inodes
87c18066016a f2fs: support passing down write hints to block layer with F2FS policy
bcdc571e8d8b f2fs: support passing down write hints given by users to block layer
92413bc12e32 f2fs: fix to clear CP_TRIMMED_FLAG
a1afb55f9784 f2fs: support large nat bitmap
636039140493 f2fs: fix to check extent cache in f2fs_drop_extent_tree
7de4fccdbce1 f2fs: restrict inline_xattr_size configuration
aae506a8b704 f2fs: fix heap mode to reset it back
8fa455bb6ea0 f2fs: fix potential corruption in area before F2FS_SUPER_OFFSET
9d9cb0ef73f9 fscrypt: fix build with pre-4.6 gcc versions
401052ffc6b4 fscrypt: remove 'ci' parameter from fscrypt_put_encryption_info()
549b2061b3b5 fscrypt: fix up fscrypt_fname_encrypted_size() for internal use
c440b5091a0c fscrypt: define fscrypt_fname_alloc_buffer() to be for presented names
7d82f0e1c39a ext4: switch to fscrypt ->symlink() helper functions
ba4efe560438 ext4: switch to fscrypt_get_symlink()
b0edc2f22d24 fscrypt: calculate NUL-padding length in one place only
62cfdd9868c7 fscrypt: move fscrypt_symlink_data to fscrypt_private.h
e4e6776522bc fscrypt: remove fscrypt_fname_usr_to_disk()
45028b5aaa4e f2fs: switch to fscrypt_get_symlink()
f62d3d31e0c7 f2fs: switch to fscrypt ->symlink() helper functions
da32a1633ad3 fscrypt: new helper function - fscrypt_get_symlink()
a7e05c731d11 fscrypt: new helper functions for ->symlink()
eb9c5fd896de fscrypt: trim down fscrypt.h includes
0a02472d8ae2 fscrypt: move fscrypt_is_dot_dotdot() to fs/crypto/fname.c
9d51ca80274c fscrypt: move fscrypt_valid_enc_modes() to fscrypt_private.h
efbfa8c6a056 fscrypt: move fscrypt_operations declaration to fscrypt_supp.h
616dbd2bdc6a fscrypt: split fscrypt_dummy_context_enabled() into supp/notsupp versions
f0c472bcbf1c fscrypt: move fscrypt_ctx declaration to fscrypt_supp.h
bc76f39109b1 fscrypt: move fscrypt_info_cachep declaration to fscrypt_private.h
b67b07ec4964 fscrypt: move fscrypt_control_page() to supp/notsupp headers
d8dfb89961d0 fscrypt: move fscrypt_has_encryption_key() to supp/notsupp headers
Signed-off-by: Jaegeuk Kim <jaegeuk@google.com>
Pull f2fs updates from Jaegeuk Kim:
"In this round, we've followed up to support some generic features such
as cgroup, block reservation, linking fscrypt_ops, delivering
write_hints, and some ioctls. And, we could fix some corner cases in
terms of power-cut recovery and subtle deadlocks.
Enhancements:
- bitmap operations to handle NAT blocks
- readahead to improve readdir speed
- switch to use fscrypt_*
- apply write hints for direct IO
- add reserve_root=%u,resuid=%u,resgid=%u to reserve blocks for root/uid/gid
- modify b_avail and b_free to consider root reserved blocks
- support cgroup writeback
- support FIEMAP_FLAG_XATTR for fibmap
- add F2FS_IOC_PRECACHE_EXTENTS to pre-cache extents
- add F2FS_IOC_{GET/SET}_PIN_FILE to pin LBAs for data blocks
- support inode creation time
Bug fixs:
- sysfile-based quota operations
- memory footprint accounting
- allow to write data on partial preallocation case
- fix deadlock case on fallocate
- fix to handle fill_super errors
- fix missing inode updates of fsync'ed file
- recover renamed file which was fsycn'ed before
- drop inmemory pages in corner error case
- keep last_disk_size correctly
- recover missing i_inline flags during roll-forward
Various clean-up patches were added as well"
Cherry-pick from origin/upstream-f2fs-stable-linux-4.9.y:
71f8f0499e8b f2fs: support inode creation time
58dc6f6fcef7 f2fs: rebuild sit page from sit info in mem
6393cef3f112 f2fs: stop issuing discard if fs is readonly
742bc90e88fa f2fs: clean up duplicated assignment in init_discard_policy
cfabb6edfbc2 f2fs: use GFP_F2FS_ZERO for cleanup
111e8456a697 f2fs: allow to recover node blocks given updated checkpoint
36e041a57ccf f2fs: recover some i_inline flags
3127a7b67ca8 f2fs: correct removexattr behavior for null valued extended attribute
86f78c1e5523 f2fs: drop page cache after fs shutdown
1a3b0047597c f2fs: stop gc/discard thread after fs shutdown
62a91a5a489f f2fs: hanlde error case in f2fs_ioc_shutdown
66356ee5f94c f2fs: split need_inplace_update
5912fbae9da5 f2fs: fix to update last_disk_size correctly
3aa46e2c2187 f2fs: kill F2FS_INLINE_XATTR_ADDRS for cleanup
acdaca27aacb f2fs: clean up error path of fill_super
cf8821115c54 f2fs: avoid hungtask when GC encrypted block if io_bits is set
4be98c9805e4 f2fs: allow quota to use reserved blocks
2a6489c87ee0 f2fs: fix to drop all inmem pages correctly
fd214422395f f2fs: speed up defragment on sparse file
6bce96329c85 f2fs: support F2FS_IOC_PRECACHE_EXTENTS
9ce3d6bb6883 f2fs: add an ioctl to disable GC for specific file
9ef5e6568449 f2fs: prevent newly created inode from being dirtied incorrectly
08ddb1917e04 f2fs: support FIEMAP_FLAG_XATTR
aa9c1c1046e0 f2fs: fix to cover f2fs_inline_data_fiemap with inode_lock
92b8f9c726ef f2fs: check node page again in write end io
4992a3ca15b3 f2fs: fix to caclulate required free section correctly
d1a6b4f6c958 f2fs: handle newly created page when revoking inmem pages
462d762b205a f2fs: add resgid and resuid to reserve root blocks
cbd5e5af8cac f2fs: implement cgroup writeback support
5a5847421d31 f2fs: remove unused pend_list_tag
37d4ca7cd1c1 f2fs: avoid high cpu usage in discard thread
02cfdab8344f f2fs: make local functions static
5fee54098565 f2fs: add reserved blocks for root user
265974636ae0 f2fs: check segment type in __f2fs_replace_block
4f76d6acc6ff f2fs: update inode info to inode page for new file
52b452817452 f2fs: show precise # of blocks that user/root can use
ae0e1fa5a816 f2fs: clean up unneeded declaration
8fc74466298f f2fs: continue to do direct IO if we only preallocate partial blocks
162464df894e f2fs: enable quota at remount from r to w
e270976ff848 f2fs: skip stop_checkpoint for user data writes
d04736926fa7 f2fs: fix missing error number for xattr operation
211cb7bb2428 f2fs: recover directory operations by fsync
2648e735ffe5 f2fs: return error during fill_super
e2a0518d8c24 f2fs: fix an error case of missing update inode page
bf1750bafe86 f2fs: fix potential hangtask in f2fs_trace_pid
c804fcf3df1f f2fs: no need return value in restore summary process
fdd41a8793ad f2fs: use unlikely for release case
a74690b03e24 f2fs: don't return value in truncate_data_blocks_range
987892cc67aa f2fs: clean up f2fs_map_blocks
d7714cb2319a f2fs: clean up hash codes
e3d2a1e946df f2fs: fix error handling in fill_super
b02e72d2942c f2fs: spread f2fs_k{m,z}alloc
ead5259de34d f2fs: inject fault to kvmalloc
e585ca29dd7e f2fs: inject fault to kzalloc
8234ed56e748 f2fs: remove a redundant conditional expression
1a9d6a9c0046 f2fs: apply write hints to select the type of segment for direct write
955e7f58f67b f2fs: switch to fscrypt_prepare_setattr()
268c7f607cb8 f2fs: switch to fscrypt_prepare_lookup()
8dfa646f972c f2fs: switch to fscrypt_prepare_rename()
d5382ccb020a f2fs: switch to fscrypt_prepare_link()
3ccc177c9b8b f2fs: switch to fscrypt_file_open()
8b5674efdc35 f2fs: remove repeated f2fs_bug_on
ba4556cdf10c f2fs: remove an excess variable
46accc925145 f2fs: fix lock dependency in between dio_rwsem & i_mmap_sem
8933908c4f93 f2fs: remove unused parameter
76b6e8ed2058 f2fs: still write data if preallocate only partial blocks
1ed753392fe7 f2fs: introduce sysfs readdir_ra to readahead inode block in readdir
4e68a15eeebc f2fs: fix concurrent problem for updating free bitmap
9be6e7596232 f2fs: remove unneeded memory footprint accounting
923df752db37 f2fs: no need to read nat block if nat_block_bitmap is set
09234be262cb f2fs: reserve nid resource for quota sysfile
Signed-off-by: Jaegeuk Kim <jaegeuk@google.com>
Pull f2fs updates from Jaegeuk Kim:
"In this round, we introduce sysfile-based quota support which is
required for Android by default. In addition, we allow that users are
able to reserve some blocks in runtime to mitigate performance drops
in low free space.
Enhancements:
- assign proper data segments according to write_hints given by user
- issue cache_flush on dirty devices only among multiple devices
- exploit cp_error flag and add more faults to enhance fault
injection test
- conduct more readaheads during f2fs_readdir
- add a range for discard commands
Bug fixes:
- fix zero stat->st_blocks when inline_data is set
- drop crypto key and free stale memory pointer while evict_inode is
failing
- fix some corner cases in free space and segment management
- fix wrong last_disk_size
This series includes lots of clean-ups and code enhancement in terms
of xattr operations, discard/flush command control. In addition, it
adds versatile debugfs entries to monitor f2fs status"
Cherry-picked from origin/upstream-f2fs-stable-linux-4.9.y:
5b2b7f7dd87f f2fs: deny accessing encryption policy if encryption is off
05dac2e89867 f2fs: inject fault in inc_valid_node_count
2e08de4fda00 f2fs: fix to clear FI_NO_PREALLOC
931ecc22b402 f2fs: expose quota information in debugfs
45d6e702d3a9 f2fs: separate nat entry mem alloc from nat_tree_lock
8e2f721703b4 f2fs: validate before set/clear free nat bitmap
27d50282d073 f2fs: avoid opened loop codes in __add_ino_entry
b1823df0e68f f2fs: apply write hints to select the type of segments for buffered write
b561061c067b f2fs: introduce scan_curseg_cache for cleanup
5772e0c102b0 f2fs: optimize the way of traversing free_nid_bitmap
a51e85eae2c3 f2fs: keep scanning until enough free nids are acquired
d75eb8d7345e f2fs: trace checkpoint reason in fsync()
bed6cffdf7e4 f2fs: keep isize once block is reserved cross EOF
5f3fdd2afc9b f2fs: avoid race in between GC and block exchange
51cb399e7ead f2fs: save a multiplication for last_nid calculation
7f41aab3d61d f2fs: fix summary info corruption
148c518517fc f2fs: remove dead code in update_meta_page
c3bc6e5183f0 f2fs: remove unneeded semicolon
9e71a0321f32 f2fs: don't bother with inode->i_version
49f72728e708 f2fs: check curseg space before foreground GC
25d0becffa0a f2fs: use rw_semaphore to protect SIT cache
0108c481d7af f2fs: support quota sys files
d4c292db7b81 f2fs: add quota_ino feature infra
1033eee92c41 f2fs: optimize __update_nat_bits
247e8951164a f2fs: modify for accurate fggc node io stat
c7272f8aebe7 Revert "f2fs: handle dirty segments inside refresh_sit_entry"
068868fc7e26 f2fs: add a function to move nid
b9f73875af11 f2fs: export SSR allocation threshold
ab30204bb9d8 f2fs: give correct trimmed blocks in fstrim
b5db2de4623f f2fs: support bio allocation error injection
58ddec85e417 f2fs: support get_page error injection
ef216e610a14 f2fs: add missing sysfs description
68ab6f8dd541 f2fs: support soft block reservation
d7947e2a3118 f2fs: handle error case when adding xattr entry
50ffaa980f98 f2fs: support flexible inline xattr size
5a8ed073c7fa f2fs: show current cp state
d888fcd74c18 f2fs: add missing quota_initialize
af1cc1ea2309 f2fs: show # of dirty segments via sysfs
6663422a3642 f2fs: stop all the operations by cp_error flag
872d8e3af080 f2fs: remove several redundant assignments
bf823c82e3fe f2fs: avoid using timespec
c70ab1b99321 f2fs: fix to correct no_fggc_candidate
0e6275dc317b Revert "f2fs: return wrong error number on f2fs_quota_write"
41d59230e302 f2fs: remove obsolete pointer for truncate_xattr_node
8c12a10f2ee4 f2fs: retry ENOMEM for quota_read|write
35e13ca2e9d9 f2fs: limit # of inmemory pages
9ca57a7e96e0 f2fs: update ctx->pos correctly when hitting hole in directory
a04208e54b9c f2fs: relocate readahead codes in readdir()
905d0370e6ab f2fs: allow readdir() to be interrupted
2dfbda03f941 f2fs: trace f2fs_readdir
d67586ddf3e9 f2fs: trace f2fs_lookup
4c94f14b3c8b f2fs: skip searching non-exist range in truncate_hole
ac5d4b425739 f2fs: expose some sectors to user in inline data or dentry case
5ded3b82dc2b f2fs: avoid stale fi->gdirty_list pointer
f6b708e25fb5 f2fs/crypto: drop crypto key at evict_inode only
33fdebbb0e7e f2fs: fix to avoid race when accessing last_disk_size
595046758d8e f2fs: Fix bool initialization/comparison
1e5305afa81e f2fs: give up CP_TRIMMED_FLAG if it drops discards
8258fd3054c1 f2fs: trace f2fs_remove_discard
6c46b37d9b43 f2fs: reduce cmd_lock coverage in __issue_discard_cmd
daf437d37cff f2fs: split discard policy
69a596797adf f2fs: wrap discard policy
28e1023e8e8a f2fs: support issuing/waiting discard in range
fd6422ea9264 f2fs: fix to flush multiple device in checkpoint
f014be822ce7 f2fs: enhance multiple device flush
0597a6e4bdcd f2fs: fix to show ino management cache size correctly
cacc1ed0c46a f2fs: drop FI_UPDATE_WRITE tag after f2fs_issue_flush
84af6aeceb49 f2fs: obsolete ALLOC_NID_LIST list
8456d343780d f2fs: convert inline data for direct I/O & FI_NO_PREALLOC
3f01af786c84 f2fs: allow readpages with NULL file pointer
2f0df25e6529 f2fs: show flush list status in sysfs
20ef20fbf78e f2fs: introduce read_xattr_block
126221de375b f2fs: introduce read_inline_xattr
127faa71f6a6 Revert "f2fs: reuse nids more aggressively"
c19928e660fb Revert "f2fs: node segment is prior to data segment selected victim"
Change-Id: I2f892e6ee75c41e84241f37b1903e0c32387d95b
Signed-off-by: Jaegeuk Kim <jaegeuk@google.com>
Cherry-picked from upstream-f2fs-stable-linux-4.9.y
Changes include:
commit 30da3a4de96733 ("f2fs: hurry up to issue discard after io interruption")
commit d1c363b48398d4 ("f2fs: fix to show correct discard_granularity in sysfs")
...
commit e6b120d4d01ab0 ("f2fs/fscrypt: catch up to v4.12")
commit 4d7931d72758db ("KEYS: Differentiate uses of rcu_dereference_key() and user_key_payload()")
Signed-off-by: Hyojun Kim <hyojun@google.com>
This patch introduces spinlock to protect updating process of ckpt_flags
field in struct f2fs_checkpoint, it avoids incorrectly updating in race
condition.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: add __is_set_ckpt_flags likewise __set_ckpt_flags]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, we used cp_version only to detect recoverable dnodes.
In order to avoid same garbage cp_version, we needed to truncate the next
dnode during checkpoint, resulting in additional discard or data write.
If we can distinguish this by using crc in addition to cp_version, we can
remove this overhead.
There is backward compatibility concern where it changes node_footer layout.
So, this patch introduces a new checkpoint flag, CP_CRC_RECOVERY_FLAG, to
detect new layout. New layout will be activated only when this flag is set.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The readahead nat pages are more likely to be reclaimed quickly, so it'd better
to gather more free nids in advance.
And, let's keep some free nids as much as possible.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In write_begin, if storage supports stable_page, we don't need to wait for
writeback to update its contents.
This patch introduces to use wait_for_stable_page instead of
wait_on_page_writeback.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The sceanrio is:
1. create fully node blocks
2. flush node blocks
3. write inline_data for all the node blocks again
4. flush node blocks redundantly
So, this patch tries to flush inline_data when flushing node blocks.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch exports a new sysfs entry 'dirty_nat_ratio' to control threshold
of dirty nat entries, if current ratio exceeds configured threshold,
checkpoint will be triggered in f2fs_balance_fs_bg for flushing dirty nats.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When testing f2fs with xfstest, generic/251 is stuck for long time,
the case uses below serials to obtain fresh released space in device,
in order to prepare for following fstrim test.
1. rm -rf /mnt/dir
2. mkdir /mnt/dir/
3. cp -axT `pwd`/ /mnt/dir/
4. goto 1
During preparing step, all nat entries will be cached in nat cache,
most of them are dirty entries with invalid blkaddr, which means
nodes related to these entries have been truncated, and they could
be reused after the dirty entries been checkpointed.
However, there was no checkpoint been triggered, so nid allocators
(e.g. mkdir, creat) will run into long journey of iterating all NAT
pages, looking for free nids in alloc_nid->build_free_nids.
Here, in f2fs_balance_fs_bg we give another chance to do checkpoint
to flush nat entries for reusing them in free nid cache when dirty
entry count exceeds 10% of max count.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use sbi->blocks_per_seg directly to avoid unnecessary calculation when using
1 << sbi->log_blocks_per_seg.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
After finishing building free nid cache, we will try to readahead
asynchronously 4 more pages for the next reloading, the count of
readahead nid pages is fixed.
In some case, like SMR drive, read less sectors with fixed count
each time we trigger RA may be low efficient, since we will face
high seeking overhead, so we'd better let user to configure this
parameter from sysfs in specific workload.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The periodic checkpoint can resolve the previous issue.
So, now we can use this again to improve the reported performance regression:
https://lkml.org/lkml/2015/10/8/20
This reverts commit 15bec0ff5a9ba6d203178fa8772259df6207942a.
Previously, we skip dentry block writes when wbc is SYNC_NONE with no memory
pressure and the number of dirty pages is pretty small.
But, we didn't skip for normal data writes, which gives us not much big impact
on overall performance.
Moreover, by skipping some data writes, kworker falls into infinite loop to try
to write blocks, when many dir inodes have only one dentry block.
So, this patch removes skipping data writes.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Introduce infra macro and data structure for rb-tree based extent cache:
Macros:
* EXT_TREE_VEC_SIZE: indicate vector size for gang lookup in extent tree.
* F2FS_MIN_EXTENT_LEN: indicate minimum length of extent managed in cache.
* EXTENT_CACHE_SHRINK_NUMBER: indicate number of extent in cache will be shrunk.
Basic data structures for extent cache:
* struct extent_tree: extent tree entry per inode.
* struct extent_node: extent info node linked in extent tree.
Besides, adding new extent cache related fields in f2fs_sb_info.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In the normal case, the radix_tree_nodes are freed successfully.
But, when cp_error was detected, we should destroy them forcefully.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In do_recover_data, we find and update previous node pages after updating
its new block addresses.
After then, we call fill_node_footer without reset field, we erase its
cold bit so that this new cold node block is written to wrong log area.
This patch fixes not to miss its old flag.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch moves one member of struct nat_entry: _flag_ to struct node_info,
so _version_ in struct node_info and _flag_ which are unsigned char type will
merge to one 32-bit space in register/memory. So the size of nat_entry will be
reduced from 28 bytes to 24 bytes (for 64-bit machine, reduce its size from 40
bytes to 32 bytes) and then slab memory using by f2fs will be reduced.
changes from v2:
o update description of memory usage gain for 64-bit machine suggested by
Changman Lee.
changes from v1:
o introduce inline copy_node_info() to copy valid data from node info suggested
by Jaegeuk Kim, it can avoid bug.
Reviewed-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds two new ioctls to release inmemory pages grabbed by atomic
writes.
o f2fs_ioc_abort_volatile_write
- If transaction was failed, all the grabbed pages and data should be written.
o f2fs_ioc_release_volatile_write
- This is to enhance the performance of PERSIST mode in sqlite.
In order to avoid huge memory consumption which causes OOM, this patch changes
volatile writes to use normal dirty pages, instead blocked flushing to the disk
as long as system does not suffer from memory pressure.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds to control the memory footprint used by ino entries.
This will conduct best effort, not strictly.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Introduce f2fs_change_bit to simplify the change bit logic in
function set_to_next_nat{sit}.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, f2fs tries to reorganize the dirty nat entries into multiple sets
according to its nid ranges. This can improve the flushing nat pages, however,
if there are a lot of cached nat entries, it becomes a bottleneck.
This patch introduces a new set management flow by removing dirty nat list and
adding a series of set operations when the nat entry becomes dirty.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch revisited whole the recovery information during the f2fs_sync_file.
In this patch, there are three information to make a decision.
a) IS_CHECKPOINTED, /* is it checkpointed before? */
b) HAS_FSYNCED_INODE, /* is the inode fsynced before? */
c) HAS_LAST_FSYNC, /* has the latest node fsync mark? */
And, the scenarios for our rule are based on:
[Term] F: fsync_mark, D: dentry_mark
1. inode(x) | CP | inode(x) | dnode(F)
2. inode(x) | CP | inode(F) | dnode(F)
3. inode(x) | CP | dnode(F) | inode(x) | inode(F)
4. inode(x) | CP | dnode(F) | inode(F)
5. CP | inode(x) | dnode(F) | inode(DF)
6. CP | inode(DF) | dnode(F)
7. CP | dnode(F) | inode(DF)
8. CP | dnode(F) | inode(x) | inode(DF)
For example, #3, the three conditions should be changed as follows.
inode(x) | CP | dnode(F) | inode(x) | inode(F)
a) x o o o o
b) x x x x o
c) x o o x o
If f2fs_sync_file stops ------^,
it should write inode(F) --------------^
So, the need_inode_block_update should return true, since
c) get_nat_flag(e, HAS_LAST_FSYNC), is false.
For example, #8,
CP | alloc | dnode(F) | inode(x) | inode(DF)
a) o x x x x
b) x x x o
c) o o x o
If f2fs_sync_file stops -------^,
it should write inode(DF) --------------^
Note that, the roll-forward policy should follow this rule, which means,
if there are any missing blocks, we doesn't need to recover that inode.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces a flag in the nat entry structure to merge various
information such as checkpointed and fsync_done marks.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The nm_i->fcnt checking is executed before spin_lock, so if another
thread delete the last free_nid from the list, the wrong nid may be
gotten. So fix the race condition by moving the nm_i->fnct checking
into spin_lock.
Signed-off-by: Huang, Ying <ying.huang@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Although building NAT journal in cursum reduce the read/write work for NAT
block, but previous design leave us lower performance when write checkpoint
frequently for these cases:
1. if journal in cursum has already full, it's a bit of waste that we flush all
nat entries to page for persistence, but not to cache any entries.
2. if journal in cursum is not full, we fill nat entries to journal util
journal is full, then flush the left dirty entries to disk without merge
journaled entries, so these journaled entries may be flushed to disk at next
checkpoint but lost chance to flushed last time.
In this patch we merge dirty entries located in same NAT block to nat entry set,
and linked all set to list, sorted ascending order by entries' count of set.
Later we flush entries in sparse set into journal as many as we can, and then
flush merged entries to disk. In this way we can not only gain in performance,
but also save lifetime of flash device.
In my testing environment, it shows this patch can help to reduce NAT block
writes obviously. In hard disk test case: cost time of fsstress is stablely
reduced by about 5%.
1. virtual machine + hard disk:
fsstress -p 20 -n 200 -l 5
node num cp count nodes/cp
based 4599.6 1803.0 2.551
patched 2714.6 1829.6 1.483
2. virtual machine + 32g micro SD card:
fsstress -p 20 -n 200 -l 1 -w -f chown=0 -f creat=4 -f dwrite=0
-f fdatasync=4 -f fsync=4 -f link=0 -f mkdir=4 -f mknod=4 -f rename=5
-f rmdir=5 -f symlink=0 -f truncate=4 -f unlink=5 -f write=0 -S
node num cp count nodes/cp
based 84.5 43.7 1.933
patched 49.2 40.0 1.23
Our latency of merging op shows not bad when handling extreme case like:
merging a great number of dirty nats:
latency(ns) dirty nat count
3089219 24922
5129423 27422
4000250 24523
change log from v1:
o fix wrong logic in add_nat_entry when grab a new nat entry set.
o swith to create slab cache in create_node_manager_caches.
o use GFP_ATOMIC instead of GFP_NOFS to avoid potential long latency.
change log from v2:
o make comment position more appropriate suggested by Jaegeuk Kim.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
fix the following checkpatch warning:
WARNING: do {} while (0) macros should not be semicolon terminated
Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch splits grab_cache_page_write_begin into grab_cache_page and
wait_on_page_writeback for node pages.
This patch intends to enhance the latency to get node pages by alleviating
unnecessary wait_on_page_writeback.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
If so many dirty dentry blocks are cached, not reached to the flush condition,
we should fall into livelock in balance_dirty_pages.
So, let's consider the mem size for the condition.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch introduce raw_nat_from_node_info() to simplfy some codes, and also
use exist function node_info_from_raw_nat() to do the same job.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
If multiple redundant fsync calls are triggered, we don't need to write its
node pages with fsync mark continuously.
So, this patch adds FI_NEED_FSYNC to track whether the latest node block is
written with the fsync mark or not.
If the mark was set, a new fsync doesn't need to write a node block.
Otherwise, we should do a new node block with the mark for roll-forward
recovery.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The NM_WOUT_THRESHOLD is now obsolete since f2fs starts to control on a basis
of the memory footprint.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch introduces ram_thresh, a sysfs entry, which controls the memory
footprint used by the free nid list and the nat cache.
Previously, the free nid list was controlled by MAX_FREE_NIDS, while the nat
cache was managed by NM_WOUT_THRESHOLD.
However, this approach cannot be applied dynamically according to the system.
So, this patch adds ram_thresh that users can specify the threshold, which is
in order of 1 / 1024.
For example, if the total ram size is 4GB and the value is set to 10 by default,
f2fs tries to control the number of free nids and nat caches not to consume over
10 * (4GB / 1024) = 10MB.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch introduces a help function f2fs_has_xattr_block for better
readability.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The nat cache entry maintains a status whether it is checkpointed or not.
So, if a new cache entry is loaded from the last checkpoint,
nat_entry->checkpointed should be true.
If the cache entry is modified as being dirty, nat_entry->checkpoint should
be false.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Update several comments:
1. use f2fs_{un}lock_op install of mutex_{un}lock_op.
2. update comment of get_data_block().
3. update description of node offset.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch fixes the use of XATTR_NODE_OFFSET.
o The offset should not use several MSB bits which are used by marking node
blocks.
o IS_DNODE should handle XATTR_NODE_OFFSET to avoid potential abnormality
during the fsync call.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Introduce help function F2FS_NODE() to simplify the conversion of node_page to
f2fs_node.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
If a file is linked, f2fs loose its parent inode number so that fsync calls
for the linked file should do checkpoint all the time.
But, if we can recover its parent inode number after the checkpoint, we can
adjust roll-forward mechanism for the further fsync calls, which is able to
improve the fsync performance significatly.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
There are various functions with common code which could be separated
out to make common routines. So, made new routines and in order to
retain the same call path and no major changes, written some macros
to access those routines.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
If (ofs % (NIDS_PER_BLOCK + 1) == 0), the node is an indirect node block.
Signed-off-by: Zhihui Zhang <zzhsuny@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
In get_node_page, we do not need to call lock_page all the time.
If the node page is cached as uptodate,
1. grab_cache_page locks the page,
2. read_node_page unlocks the page, and
3. lock_page is called for further process.
Let's avoid this.
Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
When we recover fsync'ed data after power-off-recovery, we should guarantee
that any parent inode number should be correct for each direct inode blocks.
So, let's make the following rules.
- The fsync should do checkpoint to all the inodes that were experienced hard
links.
- So, the only normal files can be recovered by roll-forward.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>