v5.11 changes the blkdev lookup mechanism completely since commit
22ae8ce8b8 ("block: simplify bdev/disk lookup in blkdev_get"),
and small part of the change is to unhash part bdev inode when
deleting partition. Turns out this kind of change does fix one
nasty issue in case of BLOCK_EXT_MAJOR:
1) when one partition is deleted & closed, disk_put_part() is always
called before bdput(bdev), see blkdev_put(); so the part's devt can
be freed & re-used before the inode is dropped
2) then new partition with same devt can be created just before the
inode in 1) is dropped, then the old inode/bdev structurein 1) is
re-used for this new partition, this way causes use-after-free and
kernel panic.
It isn't possible to backport the whole big patchset of "merge struct
block_device and struct hd_struct v4" for addressing this issue.
https://lore.kernel.org/linux-block/20201128161510.347752-1-hch@lst.de/
So fixes it by unhashing part bdev in delete_partition(), and this way
is actually aligned with v5.11+'s behavior.
Backported from the following 5.10.y commit:
5f2f775605 ("block: unhash blkdev part inode when the part is deleted")
Reported-by: Shiwei Cui <cuishw@inspur.com>
Tested-by: Shiwei Cui <cuishw@inspur.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 884f0e84f1 ]
The pending timer has been set up in blk_throtl_init(). However, the
timer is not deleted in blk_throtl_exit(). This means that the timer
handler may still be running after freeing the timer, which would
result in a use-after-free.
Fix by calling del_timer_sync() to delete the timer in blk_throtl_exit().
Signed-off-by: Li Jinlin <lijinlin3@huawei.com>
Link: https://lore.kernel.org/r/20210907121242.2885564-1-lijinlin3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 97f433c360 upstream.
We get I/O errors when we run md-raid1 on the top of dm-integrity on the
top of ramdisk.
device-mapper: integrity: Bio not aligned on 8 sectors: 0xff00, 0xff
device-mapper: integrity: Bio not aligned on 8 sectors: 0xff00, 0xff
device-mapper: integrity: Bio not aligned on 8 sectors: 0xffff, 0x1
device-mapper: integrity: Bio not aligned on 8 sectors: 0xffff, 0x1
device-mapper: integrity: Bio not aligned on 8 sectors: 0x8048, 0xff
device-mapper: integrity: Bio not aligned on 8 sectors: 0x8147, 0xff
device-mapper: integrity: Bio not aligned on 8 sectors: 0x8246, 0xff
device-mapper: integrity: Bio not aligned on 8 sectors: 0x8345, 0xbb
The ramdisk device has logical_block_size 512 and max_sectors 255. The
dm-integrity device uses logical_block_size 4096 and it doesn't affect the
"max_sectors" value - thus, it inherits 255 from the ramdisk. So, we have
a device with max_sectors not aligned on logical_block_size.
The md-raid device sees that the underlying leg has max_sectors 255 and it
will split the bios on 255-sector boundary, making the bios unaligned on
logical_block_size.
In order to fix the bug, we round down max_sectors to logical_block_size.
Cc: stable@vger.kernel.org
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This fixes the
4.9 backport commit f530afb974
which was
upstream commit f5bbbbe4d6.
The upstream commit added a call to synchronize_rcu to
_blk_mq_update_nr_hw_queues, just after freezing queues.
In the backport this landed (in blk_mq_update_nr_hw_queues instead),
just after unfreezeing queues.
This commit moves the call to its intended place.
Fixes: f530afb974 ("blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter")
Signed-off-by: Giuliano Procida <gprocida@google.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 5acb3cc2c2 upstream.
The lockdep code had reported the following unsafe locking scenario:
CPU0 CPU1
---- ----
lock(s_active#228);
lock(&bdev->bd_mutex/1);
lock(s_active#228);
lock(&bdev->bd_mutex);
*** DEADLOCK ***
The deadlock may happen when one task (CPU1) is trying to delete a
partition in a block device and another task (CPU0) is accessing
tracing sysfs file (e.g. /sys/block/dm-1/trace/act_mask) in that
partition.
The s_active isn't an actual lock. It is a reference count (kn->count)
on the sysfs (kernfs) file. Removal of a sysfs file, however, require
a wait until all the references are gone. The reference count is
treated like a rwsem using lockdep instrumentation code.
The fact that a thread is in the sysfs callback method or in the
ioctl call means there is a reference to the opended sysfs or device
file. That should prevent the underlying block structure from being
removed.
Instead of using bd_mutex in the block_device structure, a new
blk_trace_mutex is now added to the request_queue structure to protect
access to the blk_trace structure.
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Fix typo in patch subject line, and prune a comment detailing how
the code used to work.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 530ca2c9bd upstream.
A recent commit runs tag iterator callbacks under the rcu read lock,
but existing callbacks do not satisfy the non-blocking requirement.
The commit intended to prevent an iterator from accessing a queue that's
being modified. This patch fixes the original issue by taking a queue
reference instead of reading it, which allows callbacks to make blocking
calls.
Fixes: f5bbbbe4d6 ("blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter")
Acked-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Giuliano Procida <gprocida@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f5bbbbe4d6 upstream.
For blk-mq, part_in_flight/rw will invoke blk_mq_in_flight/rw to
account the inflight requests. It will access the queue_hw_ctx and
nr_hw_queues w/o any protection. When updating nr_hw_queues and
blk_mq_in_flight/rw occur concurrently, panic comes up.
Before update nr_hw_queues, the q will be frozen. So we could use
q_usage_counter to avoid the race. percpu_ref_is_zero is used here
so that we will not miss any in-flight request. The access to
nr_hw_queues and queue_hw_ctx in blk_mq_queue_tag_busy_iter are
under rcu critical section, __blk_mq_update_nr_hw_queues could use
synchronize_rcu to ensure the zeroed q_usage_counter to be globally
visible.
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Cc: Giuliano Procida <gprocida@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 1a67356e9a ]
It is wrong to use bio->bi_vcnt to figure out how many segments
there are in the bio even though CLONED flag isn't set on this bio,
because this bio may be splitted or advanced.
So always use bio_segments() in blk_recount_segments(), and it shouldn't
cause any performance loss now because the physical segment number is figured
out in blk_queue_split() and BIO_SEG_VALID is set meantime since
bdced438ac ("block: setup bi_phys_segments after splitting").
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fixes: 76d8137a31 ("blk-merge: recaculate segment if it isn't less than max segments")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit ad6bf88a6c upstream.
Logical block size has type unsigned short. That means that it can be at
most 32768. However, there are architectures that can run with 64k pages
(for example arm64) and on these architectures, it may be possible to
create block devices with 64k block size.
For exmaple (run this on an architecture with 64k pages):
Mount will fail with this error because it tries to read the superblock using 2-sector
access:
device-mapper: writecache: I/O is not aligned, sector 2, size 1024, block size 65536
EXT4-fs (dm-0): unable to read superblock
This patch changes the logical block size from unsigned short to unsigned
int to avoid the overflow.
Cc: stable@vger.kernel.org
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 3b7995a98a ]
When I doing fuzzy test, get the memleak report:
BUG: memory leak
unreferenced object 0xffff88837af80000 (size 4096):
comm "memleak", pid 3557, jiffies 4294817681 (age 112.499s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
20 00 00 00 10 01 00 00 00 00 00 00 01 00 00 00 ...............
backtrace:
[<000000001c894df8>] bio_alloc_bioset+0x393/0x590
[<000000008b139a3c>] bio_copy_user_iov+0x300/0xcd0
[<00000000a998bd8c>] blk_rq_map_user_iov+0x2f1/0x5f0
[<000000005ceb7f05>] blk_rq_map_user+0xf2/0x160
[<000000006454da92>] sg_common_write.isra.21+0x1094/0x1870
[<00000000064bb208>] sg_write.part.25+0x5d9/0x950
[<000000004fc670f6>] sg_write+0x5f/0x8c
[<00000000b0d05c7b>] __vfs_write+0x7c/0x100
[<000000008e177714>] vfs_write+0x1c3/0x500
[<0000000087d23f34>] ksys_write+0xf9/0x200
[<000000002c8dbc9d>] do_syscall_64+0x9f/0x4f0
[<00000000678d8e9a>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
If __blk_rq_map_user_iov() is failed in blk_rq_map_user_iov(),
the bio(s) which is allocated before this failing will leak. The
refcount of the bio(s) is init to 1 and increased to 2 by calling
bio_get(), but __blk_rq_unmap_user() only decrease it to 1, so
the bio cannot be freed. Fix it by calling blk_rq_unmap_user().
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit b2c0fcd287 upstream.
These were added to blkdev_ioctl() in linux-5.5 but not
blkdev_compat_ioctl, so add them now.
Cc: <stable@vger.kernel.org> # v4.4+
Fixes: bbd3e06436 ("block: add an API for Persistent Reservations")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fold in followup patch from Arnd with missing pr.h header include.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
commit d2c9be89f8 upstream.
8962842ca5 ("blk-mq: avoid sysfs buffer overflow with too many CPU cores")
avoids sysfs buffer overflow, and reserves one character for line break.
However, the last snprintf() doesn't get correct 'size' parameter passed
in, so fixed it.
Fixes: 8962842ca5 ("blk-mq: avoid sysfs buffer overflow with too many CPU cores")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Cc: Nobuhiro Iwamatsu <nobuhiro1.iwamatsu@toshiba.co.jp>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8962842ca5 upstream.
It is reported that sysfs buffer overflow can be triggered if the system
has too many CPU cores(>841 on 4K PAGE_SIZE) when showing CPUs of
hctx via /sys/block/$DEV/mq/$N/cpu_list.
Use snprintf to avoid the potential buffer overflow.
This version doesn't change the attribute format, and simply stops
showing CPU numbers if the buffer is going to overflow.
Cc: stable@vger.kernel.org
Fixes: 676141e48af7("blk-mq: don't dump CPU -> hw queue map on driver load")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 54648cf1ec upstream.
We find the memory use-after-free issue in __blk_drain_queue()
on the kernel 4.14. After read the latest kernel 4.18-rc6 we
think it has the same problem.
Memory is allocated for q->fq in the blk_init_allocated_queue().
If the elevator init function called with error return, it will
run into the fail case to free the q->fq.
Then the __blk_drain_queue() uses the same memory after the free
of the q->fq, it will lead to the unpredictable event.
The patch is to set q->fq as NULL in the fail case of
blk_init_allocated_queue().
Fixes: commit 7c94e1c157 ("block: introduce blk_flush_queue to drive flush machinery")
Cc: <stable@vger.kernel.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: xiao jin <jin.xiao@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
[groeck: backport to v4.4.y/v4.9.y (context change)]
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Alessio Balsini <balsini@android.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit b88aef36b8 ]
If __blkdev_issue_discard is in progress and a device mapper device is
reloaded with a table that doesn't support discard,
q->limits.max_discard_sectors is set to zero. This results in infinite
loop in __blkdev_issue_discard.
This patch checks if max_discard_sectors is zero and aborts with
-EOPNOTSUPP.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: Zdenek Kabelac <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit af097f5d19 ]
Don't build discards bigger than what the user asked for, if the
user decided to limit the size by writing to 'discard_max_bytes'.
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
PD#SWPL-2719
Problem:
Can't idetify correctlly when move disk have multi
partition
Solution:
Remove the function of using the partition name as
the device node name
Verify:
test pass on tl1 ref board
Change-Id: I113e63f209c529149fb94b0bb10b0b254717c2bf
Signed-off-by: Ruixuan Li <ruixuan.li@amlogic.com>
[ Upstream commit 14cb2c8a6c ]
The if-block that sets a successful return value in aix_partition()
uses 'lvip[].pps_per_lv' and 'n[].name' potentially uninitialized.
For example, if 'numlvs' is zero or alloc_lvn() fails, neither is
initialized, but are used anyway if alloc_pvd() succeeds after it.
So, make the alloc_pvd() call conditional on their initialization.
This has been hit when attaching an apparently corrupted/stressed
AIX LUN, misleading the kernel to pr_warn() invalid data and hang.
[...] partition (null) (11 pp's found) is not contiguous
[...] partition (null) (2 pp's found) is not contiguous
[...] partition (null) (3 pp's found) is not contiguous
[...] partition (null) (64 pp's found) is not contiguous
Fixes: 6ceea22bbb ("partitions: add aix lvm partition support files")
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit d43fdae7ba ]
Even if properly initialized, the lvname array (i.e., strings)
is read from disk, and might contain corrupt data (e.g., lack
the null terminating character for strings).
So, make sure the partition name string used in pr_warn() has
the null terminating character.
Fixes: 6ceea22bbb ("partitions: add aix lvm partition support files")
Suggested-by: Daniel J. Axtens <daniel.axtens@canonical.com>
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e00f4f4d0f upstream.
blkcg allocates some per-cgroup data structures with GFP_NOWAIT and
when that fails falls back to operations which aren't specific to the
cgroup. Occassional failures are expected under pressure and falling
back to non-cgroup operation is the right thing to do.
Unfortunately, I forgot to add __GFP_NOWARN to these allocations and
these expected failures end up creating a lot of noise. Add
__GFP_NOWARN.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Marc MERLIN <marc@merlins.org>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b3193bc0dc upstream.
In below scenario blkio cgroup does not work as per their assigned
weights :-
1. When the underlying device is nonrotational with a single HW queue
with depth of >= CFQ_HW_QUEUE_MIN
2. When the use case is forming two blkio cgroups cg1(weight 1000) &
cg2(wight 100) and two processes(file1 and file2) doing sync IO in
their respective blkio cgroups.
For above usecase result of fio (without this patch):-
file1: (groupid=0, jobs=1): err= 0: pid=685: Thu Jan 1 19:41:49 1970
write: IOPS=1315, BW=41.1MiB/s (43.1MB/s)(1024MiB/24906msec)
<...>
file2: (groupid=0, jobs=1): err= 0: pid=686: Thu Jan 1 19:41:49 1970
write: IOPS=1295, BW=40.5MiB/s (42.5MB/s)(1024MiB/25293msec)
<...>
// both the process BW is equal even though they belong to diff.
cgroups with weight of 1000(cg1) and 100(cg2)
In above case (for non rotational NCQ devices),
as soon as the request from cg1 is completed and even
though it is provided with higher set_slice=10, because of CFQ
algorithm when the driver tries to fetch the request, CFQ expires
this group without providing any idle time nor weight priority
and schedules another cfq group (in this case cg2).
And thus both cfq groups(cg1 & cg2) keep alternating to get the
disk time and hence loses the cgroup weight based scheduling.
Below patch gives a chance to cfq algorithm (cfq_arm_slice_timer)
to arm the slice timer in case group_idle is enabled.
In case if group_idle is also not required (including for nonrotational
NCQ drives), we need to explicitly set group_idle = 0 from sysfs for
such cases.
With this patch result of fio(for above usecase) :-
file1: (groupid=0, jobs=1): err= 0: pid=690: Thu Jan 1 00:06:08 1970
write: IOPS=1706, BW=53.3MiB/s (55.9MB/s)(1024MiB/19197msec)
<..>
file2: (groupid=0, jobs=1): err= 0: pid=691: Thu Jan 1 00:06:08 1970
write: IOPS=1043, BW=32.6MiB/s (34.2MB/s)(1024MiB/31401msec)
<..>
// In this processes BW is as per their respective cgroups weight.
Signed-off-by: Ritesh Harjani <riteshh@codeaurora.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit d6c02a9beb ]
In commit ed996a52c8 ("block: simplify and cleanup bvec pool
handling"), the value of the slab index is incremented by one in
bvec_alloc() after the allocation is done to indicate an index value of
0 does not need to be later freed.
bvec_nr_vecs() was not updated accordingly, and thus returns the wrong
value. Decrement idx before performing the lookup.
Fixes: ed996a52c8 ("block: simplify and cleanup bvec pool handling")
Signed-off-by: Greg Edwards <gedwards@ddn.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>