Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Conflicts:
fs/f2fs/extent_cache.c
Pick changes from AOSP Change-Id: Icd8a85ac0c19a8aa25cd2591a12b4e9b85bdf1c5
("f2fs: catch up to v4.14-rc1")
fs/f2fs/namei.c
Pick changes from AOSP F2FS backport commit 7d5c08fd91
("f2fs: backport from (4c1fad64 - Merge tag 'for-f2fs-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs)")
commit 06f29cc81f upstream.
In the function __ext4_grp_locked_error(), __save_error_info()
is called to save error info in super block block, but does not sync
that information to disk to info the subsequence fsck after reboot.
This patch writes the error information to disk. After this patch,
I think there is no obvious EXT4 error handle branches which leads to
"Remounting filesystem read-only" will leave the disk partition miss
the subsequence fsck.
Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 21fc61c73c upstream.
kmap() in page_follow_link_light() needed to go - allowing to hold
an arbitrary number of kmaps for long is a great way to deadlocking
the system.
new helper (inode_nohighmem(inode)) needs to be used for pagecache
symlinks inodes; done for all in-tree cases. page_follow_link_light()
instrumented to yell about anything missed.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jin Qian <jinqian@google.com>
Signed-off-by: Jin Qian <jinqian@android.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cherry-picked from origin/upstream-f2fs-stable-linux-4.4.y:
ba1ade7101 fscrypt: resolve some cherry-pick bugs
9e32f17d24 fscrypt: move to generic async completion
4ecacbed6e crypto: introduce crypto wait for async op
42d89da82b fscrypt: lock mutex before checking for bounce page pool
2286508d17 fscrypt: new helper function - fscrypt_prepare_setattr()
5cbdd42ad2 fscrypt: new helper function - fscrypt_prepare_lookup()
a31feba5c1 fscrypt: new helper function - fscrypt_prepare_rename()
95efafb623 fscrypt: new helper function - fscrypt_prepare_link()
2b4b4f98dd fscrypt: new helper function - fscrypt_file_open()
8c815f381c fscrypt: new helper function - fscrypt_require_key()
272e435025 fscrypt: remove unneeded empty fscrypt_operations structs
1034eeec51 fscrypt: remove ->is_encrypted()
32c0d3ae9d fscrypt: switch from ->is_encrypted() to IS_ENCRYPTED()
a4781dd1f1 fs, fscrypt: add an S_ENCRYPTED inode flag
ff0a3dbc93 fscrypt: clean up include file mess
bc4a61c60b fscrypt: fix dereference of NULL user_key_payload
a53dc7e005 fscrypt: make ->dummy_context() return bool
Change-Id: I461d742adc7b77177df91429a1fd9c8624a698d6
Signed-off-by: Jaegeuk Kim <jaegeuk@google.com>
commit 9d5afec6b8 upstream.
On a ppc64 machine, when mounting a fuzzed ext2 image (generated by
fsfuzzer) the following call trace is seen,
VFS: brelse: Trying to free free buffer
WARNING: CPU: 1 PID: 6913 at /root/repos/linux/fs/buffer.c:1165 .__brelse.part.6+0x24/0x40
.__brelse.part.6+0x20/0x40 (unreliable)
.ext4_find_entry+0x384/0x4f0
.ext4_lookup+0x84/0x250
.lookup_slow+0xdc/0x230
.walk_component+0x268/0x400
.path_lookupat+0xec/0x2d0
.filename_lookup+0x9c/0x1d0
.vfs_statx+0x98/0x140
.SyS_newfstatat+0x48/0x80
system_call+0x58/0x6c
This happens because the directory that ext4_find_entry() looks up has
inode->i_size that is less than the block size of the filesystem. This
causes 'nblocks' to have a value of zero. ext4_bread_batch() ends up not
reading any of the directory file's blocks. This renders the entries in
bh_use[] array to continue to have garbage data. buffer_uptodate() on
bh_use[0] can then return a zero value upon which brelse() function is
invoked.
This commit fixes the bug by returning -ENOENT when the directory file
has no associated blocks.
Reported-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c894aa9757 upstream.
Currently, fallocate(2) with KEEP_SIZE followed by a fdatasync(2)
then crash, we'll see wrong allocated block number (stat -c %b), the
blocks allocated beyond EOF are all lost. fstests generic/468
exposes this bug.
Commit 67a7d5f561 ("ext4: fix fdatasync(2) after extent
manipulation operations") fixed all the other extent manipulation
operation paths such as hole punch, zero range, collapse range etc.,
but forgot the fallocate case.
So similarly, fix it by recording the correct journal tid in ext4
inode in fallocate(2) path, so that ext4_sync_file() will wait for
the right tid to be committed on fdatasync(2).
This addresses the test failure in xfstests test generic/468.
Signed-off-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a0b3bc8553 upstream.
fscrypt_initialize(), which allocates the global bounce page pool when
an encrypted file is first accessed, uses "double-checked locking" to
try to avoid locking fscrypt_init_mutex. However, it doesn't use any
memory barriers, so it's theoretically possible for a thread to observe
a bounce page pool which has not been fully initialized. This is a
classic bug with "double-checked locking".
While "only a theoretical issue" in the latest kernel, in pre-4.8
kernels the pointer that was checked was not even the last to be
initialized, so it was easily possible for a crash (NULL pointer
dereference) to happen. This was changed only incidentally by the large
refactor to use fs/crypto/.
Solve both problems in a trivial way that can easily be backported: just
always take the mutex. It's theoretically less efficient, but it
shouldn't be noticeable in practice as the mutex is only acquired very
briefly once per encrypted file.
Later I'd like to make this use a helper macro like DO_ONCE(). However,
DO_ONCE() runs in atomic context, so we'd need to add a new macro that
allows blocking.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 51e3ae81ec upstream.
If there are pending writes subject to delayed allocation, then i_size
will show size after the writes have completed, while i_disksize
contains the value of i_size on the disk (since the writes have not
been persisted to disk).
If fallocate(2) is called with the FALLOC_FL_KEEP_SIZE flag, either
with or without the FALLOC_FL_ZERO_RANGE flag set, and the new size
after the fallocate(2) is between i_size and i_disksize, then after a
crash, if a journal commit has resulted in the changes made by the
fallocate() call to be persisted after a crash, but the delayed
allocation write has not resolved itself, i_size would not be updated,
and this would cause the following e2fsck complaint:
Inode 12, end of extent exceeds allowed value
(logical block 33, physical block 33441, len 7)
This can only take place on a sparse file, where the fallocate(2) call
is allocating blocks in a range which is before a pending delayed
allocation write which is extending i_size. Since this situation is
quite rare, and the window in which the crash must take place is
typically < 30 seconds, in practice this condition will rarely happen.
Nevertheless, it can be triggered in testing, and in particular by
xfstests generic/456.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 06bd3c36a7 upstream.
Huang has reported that in his powerfail testing he is seeing stale
block contents in some of recently allocated blocks although he mounts
ext4 in data=ordered mode. After some investigation I have found out
that indeed when delayed allocation is used, we don't add inode to
transaction's list of inodes needing flushing before commit. Originally
we were doing that but commit f3b59291a6 removed the logic with a
flawed argument that it is not needed.
The problem is that although for delayed allocated blocks we write their
contents immediately after allocating them, there is no guarantee that
the IO scheduler or device doesn't reorder things and thus transaction
allocating blocks and attaching them to inode can reach stable storage
before actual block contents. Actually whenever we attach freshly
allocated blocks to inode using a written extent, we should add inode to
transaction's ordered inode list to make sure we properly wait for block
contents to be written before committing the transaction. So that is
what we do in this patch. This also handles other cases where stale data
exposure was possible - like filling hole via mmap in
data=ordered,nodelalloc mode.
The only exception to the above rule are extending direct IO writes where
blkdev_direct_IO() waits for IO to complete before increasing i_size and
thus stale data exposure is not possible. For now we don't complicate
the code with optimizing this special case since the overhead is pretty
low. In case this is observed to be a performance problem we can always
handle it using a special flag to ext4_map_blocks().
Fixes: f3b59291a6
Reported-by: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com>
Tested-by: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
[bwh: Backported to 4.4:
- Drop check for EXT4_GET_BLOCKS_ZERO flag
- Adjust context]
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Conflicts due to AOSP's backported commits:
fs/f2fs/crypto.c
fs/f2fs/crypto_fname.c
Deleted by AOSP commit c1286ff41c ("f2fs: backport from (4c1fad64 -
Merge tag 'for-f2fs-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs)")
fs/f2fs/crypto_key.c
fs/f2fs/data.c
fs/f2fs/file.c
AOSP commit 13f002354d ("f2fs: catch up to v4.14-rc1")
override most of stable 4.4.y changes.
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
[ Upstream commit 5469d7c308 ]
Avoid using stripe_width for sbi->s_stripe value if it is not actually
set. It prevents using the stride for sbi->s_stripe.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit d9b22cf9f5 ]
When a filesystem is created using:
mkfs.ext4 -b 4096 -E stride=512 <dev>
and we try to allocate 64MB extent, we will end up directly in
ext4_mb_complex_scan_group(). This is because the request is detected
as power-of-two allocation (so we start in ext4_mb_regular_allocator()
with ac_criteria == 0) however the check before
ext4_mb_simple_scan_group() refuses the direct buddy scan because the
allocation request is too large. Since cr == 0, the check whether we
should use ext4_mb_scan_aligned() fails as well and we fall back to
ext4_mb_complex_scan_group().
Fix the problem by checking for upper limit on power-of-two requests
directly when detecting them.
Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d60b5b7854 upstream.
When an fscrypt-encrypted file is opened, we request the file's master
key from the keyrings service as a logon key, then access its payload.
However, a revoked key has a NULL payload, and we failed to check for
this. request_key() *does* skip revoked keys, but there is still a
window where the key can be revoked before we acquire its semaphore.
Fix it by checking for a NULL payload, treating it like a key which was
already revoked at the time it was requested.
Fixes: 88bd6ccdcd ("ext4 crypto: add encryption key management facilities")
Reviewed-by: James Morris <james.l.morris@oracle.com>
Cc: <stable@vger.kernel.org> [v4.1+]
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1bd8d6cd3e upstream.
In the ext4 implementations of SEEK_HOLE and SEEK_DATA, make sure we
return -ENXIO for negative offsets instead of banging around inside
the extent code and returning -EFSCORRUPTED.
Reported-by: Mateusz S <muttdini@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 173b8439e1 upstream.
While we allow deletes without the key, the following should not be
permitted:
# cd /vdc/encrypted-dir-without-key
# ls -l
total 4
-rw-r--r-- 1 root root 0 Dec 27 22:35 6,LKNRJsp209FbXoSvJWzB
-rw-r--r-- 1 root root 286 Dec 27 22:35 uRJ5vJh9gE7vcomYMqTAyD
# mv uRJ5vJh9gE7vcomYMqTAyD 6,LKNRJsp209FbXoSvJWzB
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a3bb2d5587 upstream.
When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
set, DIR1 is expected to have SGID bit set (and owning group equal to
the owning group of 'DIR0'). However when 'DIR0' also has some default
ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
'DIR1' to get cleared if user is not member of the owning group.
Fix the problem by moving posix_acl_update_mode() out of
__ext4_set_acl() into ext4_set_acl(). That way the function will not be
called when inheriting ACLs which is what we want as it prevents SGID
bit clearing and the mode has been properly set by posix_acl_create()
anyway.
Fixes: 073931017b
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a056bdaae7 upstream.
mpage_submit_page() can race with another process growing i_size and
writing data via mmap to the written-back page. As mpage_submit_page()
samples i_size too early, it may happen that ext4_bio_write_page()
zeroes out too large tail of the page and thus corrupts user data.
Fix the problem by sampling i_size only after the page has been
write-protected in page tables by clear_page_dirty_for_io() call.
Reported-by: Michael Zimmer <michael@swarm64.com>
CC: stable@vger.kernel.org
Fixes: cb20d51883
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 95f1fda47c upstream.
Quota does not get enabled for read-only mounts if filesystem
has quota feature, so that quotas cannot updated during orphan
cleanup, which will lead to quota inconsistency.
This patch turn on quotas during orphan cleanup for this case,
make sure quotas can be updated correctly.
Reported-by: Jan Kara <jack@suse.cz>
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b0a5a9589d upstream.
Current ext4 quota should always "usage enabled" if the
quota feautre is enabled. But in ext4_orphan_cleanup(), it
turn quotas off directly (used for the older journaled
quota), so we cannot turn it on again via "quotaon" unless
umount and remount ext4.
Simple reproduce:
mkfs.ext4 -O project,quota /dev/vdb1
mount -o prjquota /dev/vdb1 /mnt
chattr -p 123 /mnt
chattr +P /mnt
touch /mnt/aa /mnt/bb
exec 100<>/mnt/aa
rm -f /mnt/aa
sync
echo c > /proc/sysrq-trigger
#reboot and mount
mount -o prjquota /dev/vdb1 /mnt
#query status
quotaon -Ppv /dev/vdb1
#output
quotaon: Cannot find mountpoint for device /dev/vdb1
quotaon: No correct mountpoint specified.
This patch add check for journaled quotas to avoid incorrect
quotaoff when ext4 has quota feautre.
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit aec51758ce upstream.
On a 32-bit platform, the value of n_blcoks_count may be wrong during
the file system is resized to size larger than 2^32 blocks. This may
caused the superblock being corrupted with zero blocks count.
Fixes: 1c6bd7173d
Signed-off-by: Jerry Lee <jerrylee@qnap.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fcf5ea1099 upstream.
ext4_find_unwritten_pgoff() does not properly handle a situation when
starting index is in the middle of a page and blocksize < pagesize. The
following command shows the bug on filesystem with 1k blocksize:
xfs_io -f -c "falloc 0 4k" \
-c "pwrite 1k 1k" \
-c "pwrite 3k 1k" \
-c "seek -a -r 0" foo
In this example, neither lseek(fd, 1024, SEEK_HOLE) nor lseek(fd, 2048,
SEEK_DATA) will return the correct result.
Fix the problem by neglecting buffers in a page before starting offset.
Reported-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1ea1516fbb upstream.
kstrtoull returns 0 on success, however, in reserved_clusters_store we
will return -EINVAL if kstrtoull returns 0, it makes us fail to update
reserved_clusters value through sysfs.
Fixes: 76d33bca55
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Huang has reported that in his powerfail testing he is seeing stale
block contents in some of recently allocated blocks although he mounts
ext4 in data=ordered mode. After some investigation I have found out
that indeed when delayed allocation is used, we don't add inode to
transaction's list of inodes needing flushing before commit. Originally
we were doing that but commit f3b59291a6 removed the logic with a
flawed argument that it is not needed.
The problem is that although for delayed allocated blocks we write their
contents immediately after allocating them, there is no guarantee that
the IO scheduler or device doesn't reorder things and thus transaction
allocating blocks and attaching them to inode can reach stable storage
before actual block contents. Actually whenever we attach freshly
allocated blocks to inode using a written extent, we should add inode to
transaction's ordered inode list to make sure we properly wait for block
contents to be written before committing the transaction. So that is
what we do in this patch. This also handles other cases where stale data
exposure was possible - like filling hole via mmap in
data=ordered,nodelalloc mode.
The only exception to the above rule are extending direct IO writes where
blkdev_direct_IO() waits for IO to complete before increasing i_size and
thus stale data exposure is not possible. For now we don't complicate
the code with optimizing this special case since the overhead is pretty
low. In case this is observed to be a performance problem we can always
handle it using a special flag to ext4_map_blocks().
CC: stable@vger.kernel.org
Fixes: f3b59291a6
Reported-by: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com>
Tested-by: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
(cherry picked from commit 06bd3c36a7)
Signed-off-by: Connor O'Brien <connoro@google.com>
Bug: 62198330
Change-Id: Idc78b64e4f23e6085301c60057af6029b49a8193
This reverts commit e2968fb8e7.
For various reasons, we've had to start enforcing upstream that ext4
encryption can only be used if the filesystem superblock has the
EXT4_FEATURE_INCOMPAT_ENCRYPT flag set, as was the intended design.
Unfortunately, Android isn't ready for this quite yet, since its
userspace still needs to be updated to set the flag at mkfs time, or
else fix it later with tune2fs. It will need some more time to be fixed
properly, so for now to avoid breaking some devices, revert the kernel
change.
Bug: 36231741
Signed-off-by: Eric Biggers <ebiggers@google.com>
Change-Id: I30bd54afb68dbaf9801f8954099dffa90a2f8df1
commit 67a7d5f561 upstream.
Currently, extent manipulation operations such as hole punch, range
zeroing, or extent shifting do not record the fact that file data has
changed and thus fdatasync(2) has a work to do. As a result if we crash
e.g. after a punch hole and fdatasync, user can still possibly see the
punched out data after journal replay. Test generic/392 fails due to
these problems.
Fix the problem by properly marking that file data has changed in these
operations.
Fixes: a4bb6b64e3
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7d95eddf31 upstream.
Currently, SEEK_HOLE implementation in ext4 may both return that there's
a hole at some offset although that offset already has data and skip
some holes during a search for the next hole. The first problem is
demostrated by:
xfs_io -c "falloc 0 256k" -c "pwrite 0 56k" -c "seek -h 0" file
wrote 57344/57344 bytes at offset 0
56 KiB, 14 ops; 0.0000 sec (2.054 GiB/sec and 538461.5385 ops/sec)
Whence Result
HOLE 0
Where we can see that SEEK_HOLE wrongly returned offset 0 as containing
a hole although we have written data there. The second problem can be
demonstrated by:
xfs_io -c "falloc 0 256k" -c "pwrite 0 56k" -c "pwrite 128k 8k"
-c "seek -h 0" file
wrote 57344/57344 bytes at offset 0
56 KiB, 14 ops; 0.0000 sec (1.978 GiB/sec and 518518.5185 ops/sec)
wrote 8192/8192 bytes at offset 131072
8 KiB, 2 ops; 0.0000 sec (2 GiB/sec and 500000.0000 ops/sec)
Whence Result
HOLE 139264
Where we can see that hole at offsets 56k..128k has been ignored by the
SEEK_HOLE call.
The underlying problem is in the ext4_find_unwritten_pgoff() which is
just buggy. In some cases it fails to update returned offset when it
finds a hole (when no pages are found or when the first found page has
higher index than expected), in some cases conditions for detecting hole
are just missing (we fail to detect a situation where indices of
returned pages are not contiguous).
Fix ext4_find_unwritten_pgoff() to properly detect non-contiguous page
indices and also handle all cases where we got less pages then expected
in one place and handle it properly there.
Fixes: c8c0df241c
CC: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6b06cdee81 upstream.
When accessing an encrypted directory without the key, userspace must
operate on filenames derived from the ciphertext names, which contain
arbitrary bytes. Since we must support filenames as long as NAME_MAX,
we can't always just base64-encode the ciphertext, since that may make
it too long. Currently, this is solved by presenting long names in an
abbreviated form containing any needed filesystem-specific hashes (e.g.
to identify a directory block), then the last 16 bytes of ciphertext.
This needs to be sufficient to identify the actual name on lookup.
However, there is a bug. It seems to have been assumed that due to the
use of a CBC (ciphertext block chaining)-based encryption mode, the last
16 bytes (i.e. the AES block size) of ciphertext would depend on the
full plaintext, preventing collisions. However, we actually use CBC
with ciphertext stealing (CTS), which handles the last two blocks
specially, causing them to appear "flipped". Thus, it's actually the
second-to-last block which depends on the full plaintext.
This caused long filenames that differ only near the end of their
plaintexts to, when observed without the key, point to the wrong inode
and be undeletable. For example, with ext4:
# echo pass | e4crypt add_key -p 16 edir/
# seq -f "edir/abcdefghijklmnopqrstuvwxyz012345%.0f" 100000 | xargs touch
# find edir/ -type f | xargs stat -c %i | sort | uniq | wc -l
100000
# sync
# echo 3 > /proc/sys/vm/drop_caches
# keyctl new_session
# find edir/ -type f | xargs stat -c %i | sort | uniq | wc -l
2004
# rm -rf edir/
rm: cannot remove 'edir/_A7nNFi3rhkEQlJ6P,hdzluhODKOeWx5V': Structure needs cleaning
...
To fix this, when presenting long encrypted filenames, encode the
second-to-last block of ciphertext rather than the last 16 bytes.
Although it would be nice to solve this without depending on a specific
encryption mode, that would mean doing a cryptographic hash like SHA-256
which would be much less efficient. This way is sufficient for now, and
it's still compatible with encryption modes like HEH which are strong
pseudorandom permutations. Also, changing the presented names is still
allowed at any time because they are only provided to allow applications
to do things like delete encrypted directories. They're not designed to
be used to persistently identify files --- which would be hard to do
anyway, given that they're encrypted after all.
For ease of backports, this patch only makes the minimal fix to both
ext4 and f2fs. It leaves ubifs as-is, since ubifs doesn't compare the
ciphertext block yet. Follow-on patches will clean things up properly
and make the filesystems use a shared helper function.
Fixes: 5de0b4d0cd ("ext4 crypto: simplify and speed up filename encryption")
Reported-by: Gwendal Grignou <gwendal@chromium.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 272f98f684 upstream.
To mitigate some types of offline attacks, filesystem encryption is
designed to enforce that all files in an encrypted directory tree use
the same encryption policy (i.e. the same encryption context excluding
the nonce). However, the fscrypt_has_permitted_context() function which
enforces this relies on comparing struct fscrypt_info's, which are only
available when we have the encryption keys. This can cause two
incorrect behaviors:
1. If we have the parent directory's key but not the child's key, or
vice versa, then fscrypt_has_permitted_context() returned false,
causing applications to see EPERM or ENOKEY. This is incorrect if
the encryption contexts are in fact consistent. Although we'd
normally have either both keys or neither key in that case since the
master_key_descriptors would be the same, this is not guaranteed
because keys can be added or removed from keyrings at any time.
2. If we have neither the parent's key nor the child's key, then
fscrypt_has_permitted_context() returned true, causing applications
to see no error (or else an error for some other reason). This is
incorrect if the encryption contexts are in fact inconsistent, since
in that case we should deny access.
To fix this, retrieve and compare the fscrypt_contexts if we are unable
to set up both fscrypt_infos.
While this slightly hurts performance when accessing an encrypted
directory tree without the key, this isn't a case we really need to be
optimizing for; access *with* the key is much more important.
Furthermore, the performance hit is barely noticeable given that we are
already retrieving the fscrypt_context and doing two keyring searches in
fscrypt_get_encryption_info(). If we ever actually wanted to optimize
this case we might start by caching the fscrypt_contexts.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c9af28fdd4 upstream.
We don't want the writeback triggered from the journal commit (in
data=writeback mode) to cause the journal to abort due to
generic_writepages() returning an ENOMEM error. In addition, if
fsync() fails with ENOMEM, most applications will probably not do the
right thing.
So if we are doing a data integrity sync, and ext4_encrypt() returns
ENOMEM, we will submit any queued I/O to date, and then retry the
allocation using GFP_NOFAIL.
Google-Bug-Id: 27641567
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7b4cc9787f upstream.
Currently the case of writing via mmap to a file with inline data is not
handled. This is maybe a rare case since it requires a writable memory
map of a very small file, but it is trivial to trigger with on
inline_data filesystem, and it causes the
'BUG_ON(ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA));' in
ext4_writepages() to be hit:
mkfs.ext4 -O inline_data /dev/vdb
mount /dev/vdb /mnt
xfs_io -f /mnt/file \
-c 'pwrite 0 1' \
-c 'mmap -w 0 1m' \
-c 'mwrite 0 1' \
-c 'fsync'
kernel BUG at fs/ext4/inode.c:2723!
invalid opcode: 0000 [#1] SMP
CPU: 1 PID: 2532 Comm: xfs_io Not tainted 4.11.0-rc1-xfstests-00301-g071d9acf3d1f #633
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
task: ffff88003d3a8040 task.stack: ffffc90000300000
RIP: 0010:ext4_writepages+0xc89/0xf8a
RSP: 0018:ffffc90000303ca0 EFLAGS: 00010283
RAX: 0000028410000000 RBX: ffff8800383fa3b0 RCX: ffffffff812afcdc
RDX: 00000a9d00000246 RSI: ffffffff81e660e0 RDI: 0000000000000246
RBP: ffffc90000303dc0 R08: 0000000000000002 R09: 869618e8f99b4fa5
R10: 00000000852287a2 R11: 00000000a03b49f4 R12: ffff88003808e698
R13: 0000000000000000 R14: 7fffffffffffffff R15: 7fffffffffffffff
FS: 00007fd3e53094c0(0000) GS:ffff88003e400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd3e4c51000 CR3: 000000003d554000 CR4: 00000000003406e0
Call Trace:
? _raw_spin_unlock+0x27/0x2a
? kvm_clock_read+0x1e/0x20
do_writepages+0x23/0x2c
? do_writepages+0x23/0x2c
__filemap_fdatawrite_range+0x80/0x87
filemap_write_and_wait_range+0x67/0x8c
ext4_sync_file+0x20e/0x472
vfs_fsync_range+0x8e/0x9f
? syscall_trace_enter+0x25b/0x2d0
vfs_fsync+0x1c/0x1e
do_fsync+0x31/0x4a
SyS_fsync+0x10/0x14
do_syscall_64+0x69/0x131
entry_SYSCALL64_slow_path+0x25/0x25
We could try to be smart and keep the inline data in this case, or at
least support delayed allocation when allocating the block, but these
solutions would be more complicated and don't seem worthwhile given how
rare this case seems to be. So just fix the bug by calling
ext4_convert_inline_data() when we're asked to make a page writable, so
that any inline data gets evicted, with the block allocated immediately.
Reported-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3d43bcfef5 upstream.
This avoids potential problems caused by a race where the inode gets
renamed out from its parent directory and the parent directory is
deleted while ext4_d_revalidate() is running.
Fixes: 28b4c26396
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 28b4c26396 upstream.
Add a validation check for dentries for encrypted directory to make
sure we're not caching stale data after a key has been added or removed.
Also check to make sure that status of the encryption key is updated
when readdir(2) is executed.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9a200d075e upstream.
...otherwise an user can enable encryption for certain files even
when the filesystem is unable to support it.
Such a case would be a filesystem created by mkfs.ext4's default
settings, 1KiB block size. Ext4 supports encyption only when block size
is equal to PAGE_SIZE.
But this constraint is only checked when the encryption feature flag
is set.
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9e92f48c34 upstream.
We aren't checking to see if the in-inode extended attribute is
corrupted before we try to expand the inode's extra isize fields.
This can lead to potential crashes caused by the BUG_ON() check in
ext4_xattr_shift_entries().
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(Cherry-pick from commit 82939d7999)
The conversion is generally straightforward. The only tricky part is
that xattr block corresponding to found mbcache entry can get freed
before we get buffer lock for that block. So we have to check whether
the entry is still valid after getting buffer lock.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Bug; 32461228