[ Upstream commit 630ef623ed ]
If a tag set is shared across request queues (e.g. SCSI LUNs) then the
block layer core keeps track of the number of active request queues in
tags->active_queues. blk_mq_tag_busy() and blk_mq_tag_idle() update that
atomic counter if the hctx flag BLK_MQ_F_TAG_QUEUE_SHARED is set. Make
sure that blk_mq_exit_queue() calls blk_mq_tag_idle() before that flag is
cleared by blk_mq_del_queue_tag_set().
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Fixes: 0d2602ca30 ("blk-mq: improve support for shared tags maps")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210513171529.7977-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 03f26d8f11 ]
In case of shared sbitmap, request won't be held in plug list any more
sine commit 32bc15afed ("blk-mq: Facilitate a shared sbitmap per
tagset"), this way makes request merge from flush plug list & batching
submission not possible, so cause performance regression.
Yanhui reports performance regression when running sequential IO
test(libaio, 16 jobs, 8 depth for each job) in VM, and the VM disk
is emulated with image stored on xfs/megaraid_sas.
Fix the issue by recovering original behavior to allow to hold request
in plug list.
Cc: Yanhui Ma <yama@redhat.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: kashyap.desai@broadcom.com
Fixes: 32bc15afed ("blk-mq: Facilitate a shared sbitmap per tagset")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210514022052.1047665-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 8bfbfb0ddd ]
In f2fs_destroy_compress_ctx(), after f2fs_destroy_compress_ctx(),
cc.cluster_idx will be cleared w/ NULL_CLUSTER, f2fs_cluster_blocks()
may check wrong cluster metadata, fix it.
Fixes: 4c8ff7095b ("f2fs: support data compression")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a949dc5f2c ]
pos_fsstress testcase complains a panic as belew:
------------[ cut here ]------------
kernel BUG at fs/f2fs/compress.c:1082!
invalid opcode: 0000 [#1] SMP PTI
CPU: 4 PID: 2753477 Comm: kworker/u16:2 Tainted: G OE 5.12.0-rc1-custom #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
Workqueue: writeback wb_workfn (flush-252:16)
RIP: 0010:prepare_compress_overwrite+0x4c0/0x760 [f2fs]
Call Trace:
f2fs_prepare_compress_overwrite+0x5f/0x80 [f2fs]
f2fs_write_cache_pages+0x468/0x8a0 [f2fs]
f2fs_write_data_pages+0x2a4/0x2f0 [f2fs]
do_writepages+0x38/0xc0
__writeback_single_inode+0x44/0x2a0
writeback_sb_inodes+0x223/0x4d0
__writeback_inodes_wb+0x56/0xf0
wb_writeback+0x1dd/0x290
wb_workfn+0x309/0x500
process_one_work+0x220/0x3c0
worker_thread+0x53/0x420
kthread+0x12f/0x150
ret_from_fork+0x22/0x30
The root cause is truncate() may race with overwrite as below,
so that one reference count left in page can not guarantee the
page attaching in mapping tree all the time, after truncation,
later find_lock_page() may return NULL pointer.
- prepare_compress_overwrite
- f2fs_pagecache_get_page
- unlock_page
- f2fs_setattr
- truncate_setsize
- truncate_inode_page
- delete_from_page_cache
- find_lock_page
Fix this by avoiding referencing updated page.
Fixes: 4c8ff7095b ("f2fs: support data compression")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 608a969046 ]
When handling rw commands, for inline bio case we only consider
transfer size. This works well when req->sg_cnt fits into the
req->inline_bvec, but it will result in the warning in
__bio_add_page() when req->sg_cnt > NVMET_MAX_INLINE_BVEC.
Consider an I/O size 32768 and first page is not aligned to the page
boundary, then I/O is split in following manner :-
[ 2206.256140] nvmet: sg->length 3440 sg->offset 656
[ 2206.256144] nvmet: sg->length 4096 sg->offset 0
[ 2206.256148] nvmet: sg->length 4096 sg->offset 0
[ 2206.256152] nvmet: sg->length 4096 sg->offset 0
[ 2206.256155] nvmet: sg->length 4096 sg->offset 0
[ 2206.256159] nvmet: sg->length 4096 sg->offset 0
[ 2206.256163] nvmet: sg->length 4096 sg->offset 0
[ 2206.256166] nvmet: sg->length 4096 sg->offset 0
[ 2206.256170] nvmet: sg->length 656 sg->offset 0
Now the req->transfer_size == NVMET_MAX_INLINE_DATA_LEN i.e. 32768, but
the req->sg_cnt is (9) > NVMET_MAX_INLINE_BIOVEC which is (8).
This will result in the following warning message :-
nvmet_bdev_execute_rw()
bio_add_page()
__bio_add_page()
WARN_ON_ONCE(bio_full(bio, len));
This scenario is very hard to reproduce on the nvme-loop transport only
with rw commands issued with the passthru IOCTL interface from the host
application and the data buffer is allocated with the malloc() and not
the posix_memalign().
Fixes: 73383adfad ("nvmet: don't split large I/Os unconditionally")
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 193fcf371f ]
In this preparation patch, we add helpers to convert lbas to sectors &
sectors to lba. This is needed to eliminate code duplication in the ZBD
backend.
Use these helpers in the block device backend.
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit efed9a3337 ]
__blk_mq_sched_bio_merge() gets the ctx and hctx for the current CPU and
passes the hctx to ->bio_merge(). kyber_bio_merge() then gets the ctx
for the current CPU again and uses that to get the corresponding Kyber
context in the passed hctx. However, the thread may be preempted between
the two calls to blk_mq_get_ctx(), and the ctx returned the second time
may no longer correspond to the passed hctx. This "works" accidentally
most of the time, but it can cause us to read garbage if the second ctx
came from an hctx with more ctx's than the first one (i.e., if
ctx->index_hw[hctx->type] > hctx->nr_ctx).
This manifested as this UBSAN array index out of bounds error reported
by Jakub:
UBSAN: array-index-out-of-bounds in ../kernel/locking/qspinlock.c:130:9
index 13106 is out of range for type 'long unsigned int [128]'
Call Trace:
dump_stack+0xa4/0xe5
ubsan_epilogue+0x5/0x40
__ubsan_handle_out_of_bounds.cold.13+0x2a/0x34
queued_spin_lock_slowpath+0x476/0x480
do_raw_spin_lock+0x1c2/0x1d0
kyber_bio_merge+0x112/0x180
blk_mq_submit_bio+0x1f5/0x1100
submit_bio_noacct+0x7b0/0x870
submit_bio+0xc2/0x3a0
btrfs_map_bio+0x4f0/0x9d0
btrfs_submit_data_bio+0x24e/0x310
submit_one_bio+0x7f/0xb0
submit_extent_page+0xc4/0x440
__extent_writepage_io+0x2b8/0x5e0
__extent_writepage+0x28d/0x6e0
extent_write_cache_pages+0x4d7/0x7a0
extent_writepages+0xa2/0x110
do_writepages+0x8f/0x180
__writeback_single_inode+0x99/0x7f0
writeback_sb_inodes+0x34e/0x790
__writeback_inodes_wb+0x9e/0x120
wb_writeback+0x4d2/0x660
wb_workfn+0x64d/0xa10
process_one_work+0x53a/0xa80
worker_thread+0x69/0x5b0
kthread+0x20b/0x240
ret_from_fork+0x1f/0x30
Only Kyber uses the hctx, so fix it by passing the request_queue to
->bio_merge() instead. BFQ and mq-deadline just use that, and Kyber can
map the queues itself to avoid the mismatch.
Fixes: a6088845c2 ("block: kyber: make kyber more friendly with merging")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Link: https://lore.kernel.org/r/c7598605401a48d5cfeadebb678abd10af22b83f.1620691329.git.osandov@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 7061803522 ]
During commit 067fda1c06 ("iio: hid-sensors: move triggered buffer
setup into hid_sensor_setup_trigger"), the
iio_triggered_buffer_{setup,cleanup}() functions got moved under the
hid-sensor-trigger module.
The above change works fine, if any of the sensors get built. However, when
only the common hid-sensor-trigger module gets built (and none of the
drivers), then the IIO_TRIGGERED_BUFFER symbol isn't selected/enforced.
Previously, each driver would enforce/select the IIO_TRIGGERED_BUFFER
symbol. With this change the HID_SENSOR_IIO_TRIGGER (for the
hid-sensor-trigger module) will enforce that IIO_TRIGGERED_BUFFER gets
selected.
All HID sensor drivers select the HID_SENSOR_IIO_TRIGGER symbol. So, this
change removes the IIO_TRIGGERED_BUFFER enforcement from each driver.
Fixes: 067fda1c06 ("iio: hid-sensors: move triggered buffer setup into hid_sensor_setup_trigger")
Reported-by: Thomas Deutschmann <whissi@gentoo.org>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Alexandru Ardelean <aardelean@deviqon.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Link: https://lore.kernel.org/r/20210414084955.260117-1-aardelean@deviqon.com
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit bb9c74a5bd ]
As part of commit e81a7018d9 ("usb: dwc3: allocate gadget structure
dynamically") the dwc3_gadget_release() was added which will free
the dwc->gadget structure upon the device's removal when
usb_del_gadget_udc() is called in dwc3_gadget_exit().
However, simply freeing the gadget results a dangling pointer
situation: the endpoints created in dwc3_gadget_init_endpoints()
have their dep->endpoint.ep_list members chained off the list_head
anchored at dwc->gadget->ep_list. Thus when dwc->gadget is freed,
the first dwc3_ep in the list now has a dangling prev pointer and
likewise for the next pointer of the dwc3_ep at the tail of the list.
The dwc3_gadget_free_endpoints() that follows will result in a
use-after-free when it calls list_del().
This was caught by enabling KASAN and performing a driver unbind.
The recent commit 568262bf54 ("usb: dwc3: core: Add shutdown
callback for dwc3") also exposes this as a panic during shutdown.
There are a few possibilities to fix this. One could be to perform
a list_del() of the gadget->ep_list itself which removes it from
the rest of the dwc3_ep chain.
Another approach is what this patch does, by splitting up the
usb_del_gadget_udc() call into its separate "del" and "put"
components. This allows dwc3_gadget_free_endpoints() to be
called before the gadget is finally freed with usb_put_gadget().
Fixes: e81a7018d9 ("usb: dwc3: allocate gadget structure dynamically")
Reviewed-by: Peter Chen <peter.chen@kernel.org>
Signed-off-by: Jack Pham <jackp@codeaurora.org>
Link: https://lore.kernel.org/r/20210501093558.7375-1-jackp@codeaurora.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 237388320d ]
I am seeing missed wakeups which ultimately lead to a deadlock when I am
using virtiofs with DAX enabled and running "make -j". I had to mount
virtiofs as rootfs and also reduce to dax window size to 256M to reproduce
the problem consistently.
So here is the problem. put_unlocked_entry() wakes up waiters only
if entry is not null as well as !dax_is_conflict(entry). But if I
call multiple instances of invalidate_inode_pages2() in parallel,
then I can run into a situation where there are waiters on
this index but nobody will wake these waiters.
invalidate_inode_pages2()
invalidate_inode_pages2_range()
invalidate_exceptional_entry2()
dax_invalidate_mapping_entry_sync()
__dax_invalidate_entry() {
xas_lock_irq(&xas);
entry = get_unlocked_entry(&xas, 0);
...
...
dax_disassociate_entry(entry, mapping, trunc);
xas_store(&xas, NULL);
...
...
put_unlocked_entry(&xas, entry);
xas_unlock_irq(&xas);
}
Say a fault in in progress and it has locked entry at offset say "0x1c".
Now say three instances of invalidate_inode_pages2() are in progress
(A, B, C) and they all try to invalidate entry at offset "0x1c". Given
dax entry is locked, all tree instances A, B, C will wait in wait queue.
When dax fault finishes, say A is woken up. It will store NULL entry
at index "0x1c" and wake up B. When B comes along it will find "entry=0"
at page offset 0x1c and it will call put_unlocked_entry(&xas, 0). And
this means put_unlocked_entry() will not wake up next waiter, given
the current code. And that means C continues to wait and is not woken
up.
This patch fixes the issue by waking up all waiters when a dax entry
has been invalidated. This seems to fix the deadlock I am facing
and I can make forward progress.
Reported-by: Sergio Lopez <slp@redhat.com>
Fixes: ac401cc782 ("dax: New fault locking")
Reviewed-by: Jan Kara <jack@suse.cz>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Link: https://lore.kernel.org/r/20210428190314.1865312-4-vgoyal@redhat.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 3f804f6d20 ]
syzbot reported a possible deadlock in pvclock_gtod_notify():
CPU 0 CPU 1
write_seqcount_begin(&tk_core.seq);
pvclock_gtod_notify() spin_lock(&pool->lock);
queue_work(..., &pvclock_gtod_work) ktime_get()
spin_lock(&pool->lock); do {
seq = read_seqcount_begin(tk_core.seq)
...
} while (read_seqcount_retry(&tk_core.seq, seq);
While this is unlikely to happen, it's possible.
Delegate queue_work() to irq_work() which postpones it until the
tk_core.seq write held region is left and interrupts are reenabled.
Fixes: 16e8d74d2d ("KVM: x86: notifier for clocksource changes")
Reported-by: syzbot+6beae4000559d41d80f8@syzkaller.appspotmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Message-Id: <87h7jgm1zy.ffs@nanos.tec.linutronix.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 594b27e677 ]
Nothing prevents the following:
pvclock_gtod_notify()
queue_work(system_long_wq, &pvclock_gtod_work);
...
remove_module(kvm);
...
work_queue_run()
pvclock_gtod_work() <- UAF
Ditto for any other operation on that workqueue list head which touches
pvclock_gtod_work after module removal.
Cancel the work in kvm_arch_exit() to prevent that.
Fixes: 16e8d74d2d ("KVM: x86: notifier for clocksource changes")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Message-Id: <87czu4onry.ffs@nanos.tec.linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit d981dd1549 ]
Commit ee66e453db (KVM: lapic: Busy wait for timer to expire when
using hv_timer) tries to set ktime->expired_tscdeadline by checking
ktime->hv_timer_in_use since lapic timer oneshot/periodic modes which
are emulated by vmx preemption timer also get advanced, they leverage
the same vmx preemption timer logic with tsc-deadline mode. However,
ktime->hv_timer_in_use is cleared before apic_timer_expired() handling,
let's delay this clearing in preemption-disabled region.
Fixes: ee66e453db ("KVM: lapic: Busy wait for timer to expire when using hv_timer")
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1619608082-4187-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 227545b9a0 upstream.
Screen flickers rapidly when two 4K 60Hz monitors are in use. This issue
doesn't happen when one monitor is 4K 60Hz (pixelclock 594MHz) and
another one is 4K 30Hz (pixelclock 297MHz).
The issue is gone after setting "power_dpm_force_performance_level" to
"high". Following the indication, we found that the issue occurs when
sclk is too low.
So resolve the issue by disabling sclk switching when there are two
monitors requires high pixelclock (> 297MHz).
v2:
- Only apply the fix to Oland.
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 626e9f41f7 upstream.
When doing a fast fsync on a file, there is a race which can result in the
fsync returning success to user space without logging the inode and without
durably persisting new data.
The following example shows one possible scenario for this:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ touch /mnt/bar
$ xfs_io -f -c "pwrite -S 0xab 0 1M" -c "fsync" /mnt/baz
# Now we have:
# file bar == inode 257
# file baz == inode 258
$ mv /mnt/baz /mnt/foo
# Now we have:
# file bar == inode 257
# file foo == inode 258
$ xfs_io -c "pwrite -S 0xcd 0 1M" /mnt/foo
# fsync bar before foo, it is important to trigger the race.
$ xfs_io -c "fsync" /mnt/bar
$ xfs_io -c "fsync" /mnt/foo
# After this:
# inode 257, file bar, is empty
# inode 258, file foo, has 1M filled with 0xcd
<power failure>
# Replay the log:
$ mount /dev/sdc /mnt
# After this point file foo should have 1M filled with 0xcd and not 0xab
The following steps explain how the race happens:
1) Before the first fsync of inode 258, when it has the "baz" name, its
->logged_trans is 0, ->last_sub_trans is 0 and ->last_log_commit is -1.
The inode also has the full sync flag set;
2) After the first fsync, we set inode 258 ->logged_trans to 6, which is
the generation of the current transaction, and set ->last_log_commit
to 0, which is the current value of ->last_sub_trans (done at
btrfs_log_inode()).
The full sync flag is cleared from the inode during the fsync.
The log sub transaction that was committed had an ID of 0 and when we
synced the log, at btrfs_sync_log(), we incremented root->log_transid
from 0 to 1;
3) During the rename:
We update inode 258, through btrfs_update_inode(), and that causes its
->last_sub_trans to be set to 1 (the current log transaction ID), and
->last_log_commit remains with a value of 0.
After updating inode 258, because we have previously logged the inode
in the previous fsync, we log again the inode through the call to
btrfs_log_new_name(). This results in updating the inode's
->last_log_commit from 0 to 1 (the current value of its
->last_sub_trans).
The ->last_sub_trans of inode 257 is updated to 1, which is the ID of
the next log transaction;
4) Then a buffered write against inode 258 is made. This leaves the value
of ->last_sub_trans as 1 (the ID of the current log transaction, stored
at root->log_transid);
5) Then an fsync against inode 257 (or any other inode other than 258),
happens. This results in committing the log transaction with ID 1,
which results in updating root->last_log_commit to 1 and bumping
root->log_transid from 1 to 2;
6) Then an fsync against inode 258 starts. We flush delalloc and wait only
for writeback to complete, since the full sync flag is not set in the
inode's runtime flags - we do not wait for ordered extents to complete.
Then, at btrfs_sync_file(), we call btrfs_inode_in_log() before the
ordered extent completes. The call returns true:
static inline bool btrfs_inode_in_log(...)
{
bool ret = false;
spin_lock(&inode->lock);
if (inode->logged_trans == generation &&
inode->last_sub_trans <= inode->last_log_commit &&
inode->last_sub_trans <= inode->root->last_log_commit)
ret = true;
spin_unlock(&inode->lock);
return ret;
}
generation has a value of 6 (fs_info->generation), ->logged_trans also
has a value of 6 (set when we logged the inode during the first fsync
and when logging it during the rename), ->last_sub_trans has a value
of 1, set during the rename (step 3), ->last_log_commit also has a
value of 1 (set in step 3) and root->last_log_commit has a value of 1,
which was set in step 5 when fsyncing inode 257.
As a consequence we don't log the inode, any new extents and do not
sync the log, resulting in a data loss if a power failure happens
after the fsync and before the current transaction commits.
Also, because we do not log the inode, after a power failure the mtime
and ctime of the inode do not match those we had before.
When the ordered extent completes before we call btrfs_inode_in_log(),
then the call returns false and we log the inode and sync the log,
since at the end of ordered extent completion we update the inode and
set ->last_sub_trans to 2 (the value of root->log_transid) and
->last_log_commit to 1.
This problem is found after removing the check for the emptiness of the
inode's list of modified extents in the recent commit 209ecbb858
("btrfs: remove stale comment and logic from btrfs_inode_in_log()"),
added in the 5.13 merge window. However checking the emptiness of the
list is not really the way to solve this problem, and was never intended
to, because while that solves the problem for COW writes, the problem
persists for NOCOW writes because in that case the list is always empty.
In the case of NOCOW writes, even though we wait for the writeback to
complete before returning from btrfs_sync_file(), we end up not logging
the inode, which has a new mtime/ctime, and because we don't sync the log,
we never issue disk barriers (send REQ_PREFLUSH to the device) since that
only happens when we sync the log (when we write super blocks at
btrfs_sync_log()). So effectively, for a NOCOW case, when we return from
btrfs_sync_file() to user space, we are not guaranteeing that the data is
durably persisted on disk.
Also, while the example above uses a rename exchange to show how the
problem happens, it is not the only way to trigger it. An alternative
could be adding a new hard link to inode 258, since that also results
in calling btrfs_log_new_name() and updating the inode in the log.
An example reproducer using the addition of a hard link instead of a
rename operation:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ touch /mnt/bar
$ xfs_io -f -c "pwrite -S 0xab 0 1M" -c "fsync" /mnt/foo
$ ln /mnt/foo /mnt/foo_link
$ xfs_io -c "pwrite -S 0xcd 0 1M" /mnt/foo
$ xfs_io -c "fsync" /mnt/bar
$ xfs_io -c "fsync" /mnt/foo
<power failure>
# Replay the log:
$ mount /dev/sdc /mnt
# After this point file foo often has 1M filled with 0xab and not 0xcd
The reasons leading to the final fsync of file foo, inode 258, not
persisting the new data are the same as for the previous example with
a rename operation.
So fix by never skipping logging and log syncing when there are still any
ordered extents in flight. To avoid making the conditional if statement
that checks if logging an inode is needed harder to read, place all the
logic into an helper function with separate if statements to make it more
manageable and easier to read.
A test case for fstests will follow soon.
For NOCOW writes, the problem existed before commit b5e6c3e170
("btrfs: always wait on ordered extents at fsync time"), introduced in
kernel 4.19, then it went away with that commit since we started to always
wait for ordered extent completion before logging.
The problem came back again once the fast fsync path was changed again to
avoid waiting for ordered extent completion, in commit 487781796d
("btrfs: make fast fsyncs wait only for writeback"), added in kernel 5.10.
However, for COW writes, the race only happens after the recent
commit 209ecbb858 ("btrfs: remove stale comment and logic from
btrfs_inode_in_log()"), introduced in the 5.13 merge window. For NOCOW
writes, the bug existed before that commit. So tag 5.10+ as the release
for stable backports.
CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 588a513d34 upstream.
To ensure that instructions are observable in a new mapping, the arm64
set_pte_at() implementation cleans the D-cache and invalidates the
I-cache to the PoU. As an optimisation, this is only done on executable
mappings and the PG_dcache_clean page flag is set to avoid future cache
maintenance on the same page.
When two different processes map the same page (e.g. private executable
file or shared mapping) there's a potential race on checking and setting
PG_dcache_clean via set_pte_at() -> __sync_icache_dcache(). While on the
fault paths the page is locked (PG_locked), mprotect() does not take the
page lock. The result is that one process may see the PG_dcache_clean
flag set but the I/D cache maintenance not yet performed.
Avoid test_and_set_bit(PG_dcache_clean) in favour of separate test_bit()
and set_bit(). In the rare event of a race, the cache maintenance is
done twice.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: <stable@vger.kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210514095001.13236-1-catalin.marinas@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e9f4eee9a0 upstream.
When the weight of an active iocg is updated, weight_updated() is called
which in turn calls __propagate_weights() to update the active and inuse
weights so that the effective hierarchical weights are update accordingly.
The current implementation is incorrect for inner active nodes. For an
active leaf iocg, inuse can be any value between 1 and active and the
difference represents how much the iocg is donating. When weight is updated,
as long as inuse is clamped between 1 and the new weight, we're alright and
this is what __propagate_weights() currently implements.
However, that's not how an active inner node's inuse is set. An inner node's
inuse is solely determined by the ratio between the sums of inuse's and
active's of its children - ie. they're results of propagating the leaves'
active and inuse weights upwards. __propagate_weights() incorrectly applies
the same clamping as for a leaf when an active inner node's weight is
updated. Consider a hierarchy which looks like the following with saturating
workloads in AA and BB.
R
/ \
A B
| |
AA BB
1. For both A and B, active=100, inuse=100, hwa=0.5, hwi=0.5.
2. echo 200 > A/io.weight
3. __propagate_weights() update A's active to 200 and leave inuse at 100 as
it's already between 1 and the new active, making A:active=200,
A:inuse=100. As R's active_sum is updated along with A's active,
A:hwa=2/3, B:hwa=1/3. However, because the inuses didn't change, the
hwi's remain unchanged at 0.5.
4. The weight of A is now twice that of B but AA and BB still have the same
hwi of 0.5 and thus are doing the same amount of IOs.
Fix it by making __propgate_weights() always calculate the inuse of an
active inner iocg based on the ratio of child_inuse_sum to child_active_sum.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Dan Schatzberg <dschatzberg@fb.com>
Fixes: 7caa47151a ("blkcg: implement blk-iocost")
Cc: stable@vger.kernel.org # v5.4+
Link: https://lore.kernel.org/r/YJsxnLZV1MnBcqjj@slm.duckdns.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 22247efd82 upstream.
Patch series "mm/hugetlb: Fix issues on file sealing and fork", v2.
Hugh reported issue with F_SEAL_FUTURE_WRITE not applied correctly to
hugetlbfs, which I can easily verify using the memfd_test program, which
seems that the program is hardly run with hugetlbfs pages (as by default
shmem).
Meanwhile I found another probably even more severe issue on that hugetlb
fork won't wr-protect child cow pages, so child can potentially write to
parent private pages. Patch 2 addresses that.
After this series applied, "memfd_test hugetlbfs" should start to pass.
This patch (of 2):
F_SEAL_FUTURE_WRITE is missing for hugetlb starting from the first day.
There is a test program for that and it fails constantly.
$ ./memfd_test hugetlbfs
memfd-hugetlb: CREATE
memfd-hugetlb: BASIC
memfd-hugetlb: SEAL-WRITE
memfd-hugetlb: SEAL-FUTURE-WRITE
mmap() didn't fail as expected
Aborted (core dumped)
I think it's probably because no one is really running the hugetlbfs test.
Fix it by checking FUTURE_WRITE also in hugetlbfs_file_mmap() as what we
do in shmem_mmap(). Generalize a helper for that.
Link: https://lkml.kernel.org/r/20210503234356.9097-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20210503234356.9097-2-peterx@redhat.com
Fixes: ab3948f58f ("mm/memfd: add an F_SEAL_FUTURE_WRITE seal to memfd")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reported-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7ed9d238c7 upstream.
Consider the following sequence of events:
1. Userspace issues a UFFD ioctl, which ends up calling into
shmem_mfill_atomic_pte(). We successfully account the blocks, we
shmem_alloc_page(), but then the copy_from_user() fails. We return
-ENOENT. We don't release the page we allocated.
2. Our caller detects this error code, tries the copy_from_user() after
dropping the mmap_lock, and retries, calling back into
shmem_mfill_atomic_pte().
3. Meanwhile, let's say another process filled up the tmpfs being used.
4. So shmem_mfill_atomic_pte() fails to account blocks this time, and
immediately returns - without releasing the page.
This triggers a BUG_ON in our caller, which asserts that the page
should always be consumed, unless -ENOENT is returned.
To fix this, detect if we have such a "dangling" page when accounting
fails, and if so, release it before returning.
Link: https://lkml.kernel.org/r/20210428230858.348400-1-axelrasmussen@google.com
Fixes: cb658a453b ("userfaultfd: shmem: avoid leaking blocks and used blocks in UFFDIO_COPY")
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reported-by: Hugh Dickins <hughd@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c3187cf322 upstream.
I believe there are some issues introduced by commit 31651c6071
("hfsplus: avoid deadlock on file truncation")
HFS+ has extent records which always contains 8 extents. In case the
first extent record in catalog file gets full, new ones are allocated from
extents overflow file.
In case shrinking truncate happens to middle of an extent record which
locates in extents overflow file, the logic in hfsplus_file_truncate() was
changed so that call to hfs_brec_remove() is not guarded any more.
Right action would be just freeing the extents that exceed the new size
inside extent record by calling hfsplus_free_extents(), and then check if
the whole extent record should be removed. However since the guard
(blk_cnt > start) is now after the call to hfs_brec_remove(), this has
unfortunate effect that the last matching extent record is removed
unconditionally.
To reproduce this issue, create a file which has at least 10 extents, and
then perform shrinking truncate into middle of the last extent record, so
that the number of remaining extents is not under or divisible by 8. This
causes the last extent record (8 extents) to be removed totally instead of
truncating into middle of it. Thus this causes corruption, and lost data.
Fix for this is simply checking if the new truncated end is below the
start of this extent record, making it safe to remove the full extent
record. However call to hfs_brec_remove() can't be moved to it's previous
place since we're dropping ->tree_lock and it can cause a race condition
and the cached info being invalidated possibly corrupting the node data.
Another issue is related to this one. When entering into the block
(blk_cnt > start) we are not holding the ->tree_lock. We break out from
the loop not holding the lock, but hfs_find_exit() does unlock it. Not
sure if it's possible for someone else to take the lock under our feet,
but it can cause hard to debug errors and premature unlocking. Even if
there's no real risk of it, the locking should still always be kept in
balance. Thus taking the lock now just before the check.
Link: https://lkml.kernel.org/r/20210429165139.3082828-1-jouni.roivas@tuxera.com
Fixes: 31651c6071 ("hfsplus: avoid deadlock on file truncation")
Signed-off-by: Jouni Roivas <jouni.roivas@tuxera.com>
Reviewed-by: Anton Altaparmakov <anton@tuxera.com>
Cc: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Cc: Viacheslav Dubeyko <slava@dubeyko.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit aec86b052d upstream.
The entry flush mitigation can be enabled/disabled at runtime via a
debugfs file (entry_flush), which causes the kernel to patch itself to
enable/disable the relevant mitigations.
However depending on which mitigation we're using, it may not be safe to
do that patching while other CPUs are active. For example the following
crash:
sleeper[15639]: segfault (11) at c000000000004c20 nip c000000000004c20 lr c000000000004c20
Shows that we returned to userspace with a corrupted LR that points into
the kernel, due to executing the partially patched call to the fallback
entry flush (ie. we missed the LR restore).
Fix it by doing the patching under stop machine. The CPUs that aren't
doing the patching will be spinning in the core of the stop machine
logic. That is currently sufficient for our purposes, because none of
the patching we do is to that code or anywhere in the vicinity.
Fixes: f79643787e ("powerpc/64s: flush L1D on kernel entry")
Cc: stable@vger.kernel.org # v5.10+
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210506044959.1298123-2-mpe@ellerman.id.au
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8ec7791bae upstream.
The STF (store-to-load forwarding) barrier mitigation can be
enabled/disabled at runtime via a debugfs file (stf_barrier), which
causes the kernel to patch itself to enable/disable the relevant
mitigations.
However depending on which mitigation we're using, it may not be safe to
do that patching while other CPUs are active. For example the following
crash:
User access of kernel address (c00000003fff5af0) - exploit attempt? (uid: 0)
segfault (11) at c00000003fff5af0 nip 7fff8ad12198 lr 7fff8ad121f8 code 1
code: 40820128 e93c00d0 e9290058 7c292840 40810058 38600000 4bfd9a81 e8410018
code: 2c030006 41810154 3860ffb6 e9210098 <e94d8ff0> 7d295279 39400000 40820a3c
Shows that we returned to userspace without restoring the user r13
value, due to executing the partially patched STF exit code.
Fix it by doing the patching under stop machine. The CPUs that aren't
doing the patching will be spinning in the core of the stop machine
logic. That is currently sufficient for our purposes, because none of
the patching we do is to that code or anywhere in the vicinity.
Fixes: a048a07d7f ("powerpc/64s: Add support for a store forwarding barrier at kernel entry/exit")
Cc: stable@vger.kernel.org # v4.17+
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210506044959.1298123-1-mpe@ellerman.id.au
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1d5e4640e5 upstream.
Commit 4af22ded0e ("arc: fix memory initialization for systems
with two memory banks") fixed highmem, but for the PAE case it causes
bug messages:
| BUG: Bad page state in process swapper pfn:80000
| page:(ptrval) refcount:0 mapcount:1 mapping:00000000 index:0x0 pfn:0x80000 flags: 0x0()
| raw: 00000000 00000100 00000122 00000000 00000000 00000000 00000000 00000000
| raw: 00000000
| page dumped because: nonzero mapcount
| Modules linked in:
| CPU: 0 PID: 0 Comm: swapper Not tainted 5.12.0-rc5-00003-g1e43c377a79f #1
This is because the fix expects highmem to be always less than
lowmem and uses min_low_pfn as an upper zone border for highmem.
max_high_pfn should be ok for both highmem and highmem+PAE cases.
Fixes: 4af22ded0e ("arc: fix memory initialization for systems with two memory banks")
Signed-off-by: Vladimir Isaev <isaev@synopsys.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: stable@vger.kernel.org #5.8 onwards
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c5f756d8c6 upstream.
32-bit PAGE_MASK can not be used as a mask for physical addresses
when PAE is enabled. PAGE_MASK_PHYS must be used for physical
addresses instead of PAGE_MASK.
Without this, init gets SIGSEGV if pte_modify was called:
| potentially unexpected fatal signal 11.
| Path: /bin/busybox
| CPU: 0 PID: 1 Comm: init Not tainted 5.12.0-rc5-00003-g1e43c377a79f-dirty
| Insn could not be fetched
| @No matching VMA found
| ECR: 0x00040000 EFA: 0x00000000 ERET: 0x00000000
| STAT: 0x80080082 [IE U ] BTA: 0x00000000
| SP: 0x5f9ffe44 FP: 0x00000000 BLK: 0xaf3d4
| LPS: 0x000d093e LPE: 0x000d0950 LPC: 0x00000000
| r00: 0x00000002 r01: 0x5f9fff14 r02: 0x5f9fff20
| ...
| Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
Signed-off-by: Vladimir Isaev <isaev@synopsys.com>
Reported-by: kernel test robot <lkp@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: stable@vger.kernel.org
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3433adc8bd upstream.
We have NR_syscall syscalls from [0 .. NR_syscall-1].
However the check for invalid syscall number is "> NR_syscall" as
opposed to >=. This off-by-one error erronesously allows "NR_syscall"
to be treated as valid syscall causeing out-of-bounds access into
syscall-call table ensuing a crash (holes within syscall table have a
invalid-entry handler but this is beyond the array implementing the
table).
This problem showed up on v5.6 kernel when testing glibc 2.33 (v5.10
kernel capable, includng faccessat2 syscall 439). The v5.6 kernel has
NR_syscalls=439 (0 to 438). Due to the bug, 439 passed by glibc was
not handled as -ENOSYS but processed leading to a crash.
Link: https://github.com/foss-for-synopsys-dwc-arc-processors/linux/issues/48
Reported-by: Shahab Vahedi <shahab@synopsys.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 61343e6da7 ]
When FEC mode was changed the link didn't know it because
the link was not reset and new parameters were not negotiated.
Set a flag 'I40E_AQ_PHY_ENABLE_ATOMIC_LINK' in 'abilities'
to restart the link and make it run with the new settings.
Fixes: 1d96340196 ("i40e: Add support FEC configuration for Fortville 25G")
Signed-off-by: Jaroslaw Gawin <jaroslawx.gawin@intel.com>
Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com>
Tested-by: Dave Switzer <david.switzer@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>