Commit Graph

4708 Commits

Author SHA1 Message Date
Jeffle Xu
f72f8c5b73 dm table: fix DAX iterate_devices based device capability checks
commit 5b0fab5089 upstream.

Fix dm_table_supports_dax() and invert logic of both
iterate_devices_callout_fn so that all devices' DAX capabilities are
properly checked.

Fixes: 545ed20e6d ("dm: add infrastructure for DAX support")
Cc: stable@vger.kernel.org
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
[jeffle: no dax write cache, no dax synchronous]
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-16 10:46:04 +09:00
Jeffle Xu
271e19b246 dm table: fix iterate_devices based device capability checks
commit a4c8dd9c2d upstream.

According to the definition of dm_iterate_devices_fn:
 * This function must iterate through each section of device used by the
 * target until it encounters a non-zero return code, which it then returns.
 * Returns zero if no callout returned non-zero.

For some target type (e.g. dm-stripe), one call of iterate_devices() may
iterate multiple underlying devices internally, in which case a non-zero
return code returned by iterate_devices_callout_fn will stop the iteration
in advance. No iterate_devices_callout_fn should return non-zero unless
device iteration should stop.

Rename dm_table_requires_stable_pages() to dm_table_any_dev_attr() and
elevate it for reuse to stop iterating (and return non-zero) on the
first device that causes iterate_devices_callout_fn to return non-zero.
Use dm_table_any_dev_attr() to properly iterate through devices.

Rename device_is_nonrot() to device_is_rotational() and invert logic
accordingly to fix improper disposition.

[jeffle: backport notes]
No stable writes. Also convert the no_sg_merge capability check, which
is introduced by commit 200612ec33 ("dm table: propagate
QUEUE_FLAG_NO_SG_MERGE"), and removed since commit 2705c93742 ("block:
kill QUEUE_FLAG_NO_SG_MERGE") in v5.1.

Fixes: c3c4555edd ("dm table: clear add_random unless all devices have it set")
Fixes: 4693c9668f ("dm table: propagate non rotational flag")
Cc: stable@vger.kernel.org
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-16 10:46:03 +09:00
Nikos Tsironis
f680c4d88f dm era: Update in-core bitset after committing the metadata
commit 2099b145d7 upstream.

In case of a system crash, dm-era might fail to mark blocks as written
in its metadata, although the corresponding writes to these blocks were
passed down to the origin device and completed successfully.

Consider the following sequence of events:

1. We write to a block that has not been yet written in the current era
2. era_map() checks the in-core bitmap for the current era and sees
   that the block is not marked as written.
3. The write is deferred for submission after the metadata have been
   updated and committed.
4. The worker thread processes the deferred write
   (process_deferred_bios()) and marks the block as written in the
   in-core bitmap, **before** committing the metadata.
5. The worker thread starts committing the metadata.
6. We do more writes that map to the same block as the write of step (1)
7. era_map() checks the in-core bitmap and sees that the block is marked
   as written, **although the metadata have not been committed yet**.
8. These writes are passed down to the origin device immediately and the
   device reports them as completed.
9. The system crashes, e.g., power failure, before the commit from step
   (5) finishes.

When the system recovers and we query the dm-era target for the list of
written blocks it doesn't report the aforementioned block as written,
although the writes of step (6) completed successfully.

The issue is that era_map() decides whether to defer or not a write
based on non committed information. The root cause of the bug is that we
update the in-core bitmap, **before** committing the metadata.

Fix this by updating the in-core bitmap **after** successfully
committing the metadata.

Fixes: eec40579d8 ("dm: add era target")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-16 10:39:15 +09:00
Nikos Tsironis
32317e9971 dm era: only resize metadata in preresume
commit cca2c6aebe upstream.

Metadata resize shouldn't happen in the ctr. The ctr loads a temporary
(inactive) table that will only become active upon resume. That is why
resize should always be done in terms of resume. Otherwise a load (ctr)
whose inactive table never becomes active will incorrectly resize the
metadata.

Also, perform the resize directly in preresume, instead of using the
worker to do it.

The worker might run other metadata operations, e.g., it could start
digestion, before resizing the metadata. These operations will end up
using the old size.

This could lead to errors, like:

  device-mapper: era: metadata_digest_transcribe_writeset: dm_array_set_value failed
  device-mapper: era: process_old_eras: digest step failed, stopping digestion

The reason of the above error is that the worker started the digestion
of the archived writeset using the old, larger size.

As a result, metadata_digest_transcribe_writeset tried to write beyond
the end of the era array.

Fixes: eec40579d8 ("dm: add era target")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-16 10:39:04 +09:00
Nikos Tsironis
35823f1ee5 dm era: Reinitialize bitset cache before digesting a new writeset
commit 2524933307 upstream.

In case of devices with at most 64 blocks, the digestion of consecutive
eras uses the writeset of the first era as the writeset of all eras to
digest, leading to lost writes. That is, we lose the information about
what blocks were written during the affected eras.

The digestion code uses a dm_disk_bitset object to access the archived
writesets. This structure includes a one word (64-bit) cache to reduce
the number of array lookups.

This structure is initialized only once, in metadata_digest_start(),
when we kick off digestion.

But, when we insert a new writeset into the writeset tree, before the
digestion of the previous writeset is done, or equivalently when there
are multiple writesets in the writeset tree to digest, then all these
writesets are digested using the same cache and the cache is not
re-initialized when moving from one writeset to the next.

For devices with more than 64 blocks, i.e., the size of the cache, the
cache is indirectly invalidated when we move to a next set of blocks, so
we avoid the bug.

But for devices with at most 64 blocks we end up using the same cached
data for digesting all archived writesets, i.e., the cache is loaded
when digesting the first writeset and it never gets reloaded, until the
digestion is done.

As a result, the writeset of the first era to digest is used as the
writeset of all the following archived eras, leading to lost writes.

Fix this by reinitializing the dm_disk_bitset structure, and thus
invalidating the cache, every time the digestion code starts digesting a
new writeset.

Fixes: eec40579d8 ("dm: add era target")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-16 10:39:03 +09:00
Nikos Tsironis
6c3f471eff dm era: Use correct value size in equality function of writeset tree
commit 64f2d15afe upstream.

Fix the writeset tree equality test function to use the right value size
when comparing two btree values.

Fixes: eec40579d8 ("dm: add era target")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Reviewed-by: Ming-Hung Tsai <mtsai@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-16 10:39:01 +09:00
Nikos Tsironis
5e5536f00e dm era: Fix bitset memory leaks
commit 904e6b2666 upstream.

Deallocate the memory allocated for the in-core bitsets when destroying
the target and in error paths.

Fixes: eec40579d8 ("dm: add era target")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Reviewed-by: Ming-Hung Tsai <mtsai@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-16 10:39:00 +09:00
Nikos Tsironis
282ed5dc7d dm era: Verify the data block size hasn't changed
commit c8e846ff93 upstream.

dm-era doesn't support changing the data block size of existing devices,
so check explicitly that the requested block size for a new target
matches the one stored in the metadata.

Fixes: eec40579d8 ("dm: add era target")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Reviewed-by: Ming-Hung Tsai <mtsai@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-16 10:38:59 +09:00
Nikos Tsironis
b5601cf554 dm era: Recover committed writeset after crash
commit de89afc1e4 upstream.

Following a system crash, dm-era fails to recover the committed writeset
for the current era, leading to lost writes. That is, we lose the
information about what blocks were written during the affected era.

dm-era assumes that the writeset of the current era is archived when the
device is suspended. So, when resuming the device, it just moves on to
the next era, ignoring the committed writeset.

This assumption holds when the device is properly shut down. But, when
the system crashes, the code that suspends the target never runs, so the
writeset for the current era is not archived.

There are three issues that cause the committed writeset to get lost:

1. dm-era doesn't load the committed writeset when opening the metadata
2. The code that resizes the metadata wipes the information about the
   committed writeset (assuming it was loaded at step 1)
3. era_preresume() starts a new era, without taking into account that
   the current era might not have been archived, due to a system crash.

To fix this:

1. Load the committed writeset when opening the metadata
2. Fix the code that resizes the metadata to make sure it doesn't wipe
   the loaded writeset
3. Fix era_preresume() to check for a loaded writeset and archive it,
   before starting a new era.

Fixes: eec40579d8 ("dm: add era target")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-16 10:38:58 +09:00
Hannes Reinecke
82f0f32cd3 dm: avoid filesystem lookup in dm_get_dev_t()
commit 809b1e4945 upstream.

This reverts commit
644bda6f34 ("dm table: fall back to getting device using name_to_dev_t()")

dm_get_dev_t() is just used to convert an arbitrary 'path' string
into a dev_t. It doesn't presume that the device is present; that
check will be done later, as the only caller is dm_get_device(),
which does a dm_get_table_device() later on, which will properly
open the device.

So if the path string already _is_ in major:minor representation
we can convert it directly, avoiding a recursion into the filesystem
to lookup the block device.

This avoids a hang in multipath_message() when the filesystem is
inaccessible.

Fixes: 644bda6f34 ("dm table: fall back to getting device using name_to_dev_t()")
Cc: stable@vger.kernel.org
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Martin Wilck <mwilck@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-16 10:32:30 +09:00
Qinglang Miao
70347ce213 dm ioctl: fix error return code in target_message
[ Upstream commit 4d7659bfbe ]

Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.

Fixes: 2ca4c92f58 ("dm ioctl: prevent empty message")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-16 09:57:44 +09:00
Dae R. Jeong
4deca7507f md: fix a warning caused by a race between concurrent md_ioctl()s
commit c731b84b51 upstream.

Syzkaller reports a warning as belows.
WARNING: CPU: 0 PID: 9647 at drivers/md/md.c:7169
...
Call Trace:
...
RIP: 0010:md_ioctl+0x4017/0x5980 drivers/md/md.c:7169
RSP: 0018:ffff888096027950 EFLAGS: 00010293
RAX: ffff88809322c380 RBX: 0000000000000932 RCX: ffffffff84e266f2
RDX: 0000000000000000 RSI: ffffffff84e299f7 RDI: 0000000000000007
RBP: ffff888096027bc0 R08: ffff88809322c380 R09: ffffed101341a482
R10: ffff888096027940 R11: ffff88809a0d240f R12: 0000000000000932
R13: ffff8880a2c14100 R14: ffff88809a0d2268 R15: ffff88809a0d2408
 __blkdev_driver_ioctl block/ioctl.c:304 [inline]
 blkdev_ioctl+0xece/0x1c10 block/ioctl.c:606
 block_ioctl+0xee/0x130 fs/block_dev.c:1930
 vfs_ioctl fs/ioctl.c:46 [inline]
 file_ioctl fs/ioctl.c:509 [inline]
 do_vfs_ioctl+0xd5f/0x1380 fs/ioctl.c:696
 ksys_ioctl+0xab/0xd0 fs/ioctl.c:713
 __do_sys_ioctl fs/ioctl.c:720 [inline]
 __se_sys_ioctl fs/ioctl.c:718 [inline]
 __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

This is caused by a race between two concurrenct md_ioctl()s closing
the array.
CPU1 (md_ioctl())                   CPU2 (md_ioctl())
------                              ------
set_bit(MD_CLOSING, &mddev->flags);
did_set_md_closing = true;
                                    WARN_ON_ONCE(test_bit(MD_CLOSING,
                                            &mddev->flags));
if(did_set_md_closing)
    clear_bit(MD_CLOSING, &mddev->flags);

Fix the warning by returning immediately if the MD_CLOSING bit is set
in &mddev->flags which indicates that the array is being closed.

Fixes: 065e519e71 ("md: MD_CLOSING needs to be cleared after called md_set_readonly or do_md_stop")
Reported-by: syzbot+1e46a0864c1a6e9bd3d8@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Dae R. Jeong <dae.r.jeong@kaist.ac.kr>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-16 09:56:39 +09:00
Thomas Gleixner
c53b301df0 dm table: Remove BUG_ON(in_interrupt())
[ Upstream commit e7b624183d ]

The BUG_ON(in_interrupt()) in dm_table_event() is a historic leftover from
a rework of the dm table code which changed the calling context.

Issuing a BUG for a wrong calling context is frowned upon and
in_interrupt() is deprecated and only covering parts of the wrong
contexts. The sanity check for the context is covered by
CONFIG_DEBUG_ATOMIC_SLEEP and other debug facilities already.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-16 09:56:21 +09:00
Song Liu
6b55908db3 md/raid5: fix oops during stripe resizing
commit b44c018cdf upstream.

KoWei reported crash during raid5 reshape:

[ 1032.252932] Oops: 0002 [#1] SMP PTI
[...]
[ 1032.252943] RIP: 0010:memcpy_erms+0x6/0x10
[...]
[ 1032.252947] RSP: 0018:ffffba1ac0c03b78 EFLAGS: 00010286
[ 1032.252949] RAX: 0000784ac0000000 RBX: ffff91bec3d09740 RCX: 0000000000001000
[ 1032.252951] RDX: 0000000000001000 RSI: ffff91be6781c000 RDI: 0000784ac0000000
[ 1032.252953] RBP: ffffba1ac0c03bd8 R08: 0000000000001000 R09: ffffba1ac0c03bf8
[ 1032.252954] R10: 0000000000000000 R11: 0000000000000000 R12: ffffba1ac0c03bf8
[ 1032.252955] R13: 0000000000001000 R14: 0000000000000000 R15: 0000000000000000
[ 1032.252958] FS:  0000000000000000(0000) GS:ffff91becf500000(0000) knlGS:0000000000000000
[ 1032.252959] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1032.252961] CR2: 0000784ac0000000 CR3: 000000031780a002 CR4: 00000000001606e0
[ 1032.252962] Call Trace:
[ 1032.252969]  ? async_memcpy+0x179/0x1000 [async_memcpy]
[ 1032.252977]  ? raid5_release_stripe+0x8e/0x110 [raid456]
[ 1032.252982]  handle_stripe_expansion+0x15a/0x1f0 [raid456]
[ 1032.252988]  handle_stripe+0x592/0x1270 [raid456]
[ 1032.252993]  handle_active_stripes.isra.0+0x3cb/0x5a0 [raid456]
[ 1032.252999]  raid5d+0x35c/0x550 [raid456]
[ 1032.253002]  ? schedule+0x42/0xb0
[ 1032.253006]  ? schedule_timeout+0x10e/0x160
[ 1032.253011]  md_thread+0x97/0x160
[ 1032.253015]  ? wait_woken+0x80/0x80
[ 1032.253019]  kthread+0x104/0x140
[ 1032.253022]  ? md_start_sync+0x60/0x60
[ 1032.253024]  ? kthread_park+0x90/0x90
[ 1032.253027]  ret_from_fork+0x35/0x40

This is because cache_size_mutex was unlocked too early in resize_stripes,
which races with grow_one_stripe() that grow_one_stripe() allocates a
stripe with wrong pool_size.

Fix this issue by unlocking cache_size_mutex after updating pool_size.

Cc: <stable@vger.kernel.org> # v4.4+
Reported-by: KoWei Sung <winders@amazon.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-16 09:45:03 +09:00
Zhao Heming
f1d38c4359 md/bitmap: md_bitmap_get_counter returns wrong blocks
[ Upstream commit d837f7277f ]

md_bitmap_get_counter() has code:

```
    if (bitmap->bp[page].hijacked ||
        bitmap->bp[page].map == NULL)
        csize = ((sector_t)1) << (bitmap->chunkshift +
                      PAGE_COUNTER_SHIFT - 1);
```

The minus 1 is wrong, this branch should report 2048 bits of space.
With "-1" action, this only report 1024 bit of space.

This bug code returns wrong blocks, but it doesn't inflence bitmap logic:
1. Most callers focus this function return value (the counter of offset),
   not the parameter blocks.
2. The bug is only triggered when hijacked is true or map is NULL.
   the hijacked true condition is very rare.
   the "map == null" only true when array is creating or resizing.
3. Even the caller gets wrong blocks, current code makes caller just to
   call md_bitmap_get_counter() one more time.

Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-16 09:44:51 +09:00
Guoju Fang
e6ede97e26 bcache: fix a lost wake-up problem caused by mca_cannibalize_lock
[ Upstream commit 34cf78bf34 ]

This patch fix a lost wake-up problem caused by the race between
mca_cannibalize_lock and bch_cannibalize_unlock.

Consider two processes, A and B. Process A is executing
mca_cannibalize_lock, while process B takes c->btree_cache_alloc_lock
and is executing bch_cannibalize_unlock. The problem happens that after
process A executes cmpxchg and will execute prepare_to_wait. In this
timeslice process B executes wake_up, but after that process A executes
prepare_to_wait and set the state to TASK_INTERRUPTIBLE. Then process A
goes to sleep but no one will wake up it. This problem may cause bcache
device to dead.

Signed-off-by: Guoju Fang <fangguoju@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-16 09:15:02 +09:00
Ye Bin
7b20541f30 dm thin metadata: Avoid returning cmd->bm wild pointer on error
commit 219403d7e5 upstream.

Maybe __create_persistent_data_objects() caller will use PTR_ERR as a
pointer, it will lead to some strange things.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-16 09:05:01 +09:00
Ye Bin
8f98e2686a dm cache metadata: Avoid returning cmd->bm wild pointer on error
commit d16ff19e69 upstream.

Maybe __create_persistent_data_objects() caller will use PTR_ERR as a
pointer, it will lead to some strange things.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-16 09:05:00 +09:00
Kees Cook
94dc0555b8 overflow.h: Add allocation size calculation helpers
commit 610b15c50e upstream.

In preparation for replacing unchecked overflows for memory allocations,
this creates helpers for the 3 most common calculations:

array_size(a, b): 2-dimensional array
array3_size(a, b, c): 3-dimensional array
struct_size(ptr, member, n): struct followed by n-many trailing members

Each of these return SIZE_MAX on overflow instead of wrapping around.

(Additionally renames a variable named "array_size" to avoid future
collision.)

Co-developed-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-16 08:57:52 +09:00
Coly Li
f155eb0fa3 bcache: allocate meta data pages as compound pages
commit 5fe4886785 upstream.

There are some meta data of bcache are allocated by multiple pages,
and they are used as bio bv_page for I/Os to the cache device. for
example cache_set->uuids, cache->disk_buckets, journal_write->data,
bset_tree->data.

For such meta data memory, all the allocated pages should be treated
as a single memory block. Then the memory management and underlying I/O
code can treat them more clearly.

This patch adds __GFP_COMP flag to all the location allocating >0 order
pages for the above mentioned meta data. Then their pages are treated
as compound pages now.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-16 08:48:03 +09:00
ChangSyun Peng
f09cb5fe1a md/raid5: Fix Force reconstruct-write io stuck in degraded raid5
commit a1c6ae3d9f upstream.

In degraded raid5, we need to read parity to do reconstruct-write when
data disks fail. However, we can not read parity from
handle_stripe_dirtying() in force reconstruct-write mode.

Reproducible Steps:

1. Create degraded raid5
mdadm -C /dev/md2 --assume-clean -l5 -n3 /dev/sda2 /dev/sdb2 missing
2. Set rmw_level to 0
echo 0 > /sys/block/md2/md/rmw_level
3. IO to raid5

Now some io may be stuck in raid5. We can use handle_stripe_fill() to read
the parity in this situation.

Cc: <stable@vger.kernel.org> # v4.4+
Reviewed-by: Alex Wu <alexwu@synology.com>
Reviewed-by: BingJing Chang <bingjingc@synology.com>
Reviewed-by: Danny Shih <dannyshih@synology.com>
Signed-off-by: ChangSyun Peng <allenpeng@synology.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-16 08:48:02 +09:00
Coly Li
58dcc2cb50 bcache: fix super block seq numbers comparision in register_cache_set()
[ Upstream commit 117f636ea6 ]

In register_cache_set(), c is pointer to struct cache_set, and ca is
pointer to struct cache, if ca->sb.seq > c->sb.seq, it means this
registering cache has up to date version and other members, the in-
memory version and other members should be updated to the newer value.

But current implementation makes a cache set only has a single cache
device, so the above assumption works well except for a special case.
The execption is when a cache device new created and both ca->sb.seq and
c->sb.seq are 0, because the super block is never flushed out yet. In
the location for the following if() check,
2156         if (ca->sb.seq > c->sb.seq) {
2157                 c->sb.version           = ca->sb.version;
2158                 memcpy(c->sb.set_uuid, ca->sb.set_uuid, 16);
2159                 c->sb.flags             = ca->sb.flags;
2160                 c->sb.seq               = ca->sb.seq;
2161                 pr_debug("set version = %llu\n", c->sb.version);
2162         }
c->sb.version is not initialized yet and valued 0. When ca->sb.seq is 0,
the if() check will fail (because both values are 0), and the cache set
version, set_uuid, flags and seq won't be updated.

The above problem is hiden for current code, because the bucket size is
compatible among different super block version. And the next time when
running cache set again, ca->sb.seq will be larger than 0 and cache set
super block version will be updated properly.

But if the large bucket feature is enabled,  sb->bucket_size is the low
16bits of the bucket size. For a power of 2 value, when the actual
bucket size exceeds 16bit width, sb->bucket_size will always be 0. Then
read_super_common() will fail because the if() check to
is_power_of_2(sb->bucket_size) is false. This is how the long time
hidden bug is triggered.

This patch modifies the if() check to the following way,
2156         if (ca->sb.seq > c->sb.seq || c->sb.seq == 0) {
Then cache set's version, set_uuid, flags and seq will always be updated
corectly including for a new created cache device.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-16 08:46:08 +09:00
Zhao Heming
75b14ba657 md-cluster: fix wild pointer of unlock_all_bitmaps()
[ Upstream commit 60f80d6f2d ]

reproduction steps:
```
node1 # mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda
/dev/sdb
node2 # mdadm -A /dev/md0 /dev/sda /dev/sdb
node1 # mdadm -G /dev/md0 -b none
mdadm: failed to remove clustered bitmap.
node1 # mdadm -S --scan
^C  <==== mdadm hung & kernel crash
```

kernel stack:
```
[  335.230657] general protection fault: 0000 [#1] SMP NOPTI
[...]
[  335.230848] Call Trace:
[  335.230873]  ? unlock_all_bitmaps+0x5/0x70 [md_cluster]
[  335.230886]  unlock_all_bitmaps+0x3d/0x70 [md_cluster]
[  335.230899]  leave+0x10f/0x190 [md_cluster]
[  335.230932]  ? md_super_wait+0x93/0xa0 [md_mod]
[  335.230947]  ? leave+0x5/0x190 [md_cluster]
[  335.230973]  md_cluster_stop+0x1a/0x30 [md_mod]
[  335.230999]  md_bitmap_free+0x142/0x150 [md_mod]
[  335.231013]  ? _cond_resched+0x15/0x40
[  335.231025]  ? mutex_lock+0xe/0x30
[  335.231056]  __md_stop+0x1c/0xa0 [md_mod]
[  335.231083]  do_md_stop+0x160/0x580 [md_mod]
[  335.231119]  ? 0xffffffffc05fb078
[  335.231148]  md_ioctl+0xa04/0x1930 [md_mod]
[  335.231165]  ? filename_lookup+0xf2/0x190
[  335.231179]  blkdev_ioctl+0x93c/0xa10
[  335.231205]  ? _cond_resched+0x15/0x40
[  335.231214]  ? __check_object_size+0xd4/0x1a0
[  335.231224]  block_ioctl+0x39/0x40
[  335.231243]  do_vfs_ioctl+0xa0/0x680
[  335.231253]  ksys_ioctl+0x70/0x80
[  335.231261]  __x64_sys_ioctl+0x16/0x20
[  335.231271]  do_syscall_64+0x65/0x1f0
[  335.231278]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
```

Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-16 08:45:55 +09:00
Zhiqiang Liu
6cd0db9f72 bcache: fix potential deadlock problem in btree_gc_coalesce
[ Upstream commit be23e83733 ]

coccicheck reports:
  drivers/md//bcache/btree.c:1538:1-7: preceding lock on line 1417

In btree_gc_coalesce func, if the coalescing process fails, we will goto
to out_nocoalesce tag directly without releasing new_nodes[i]->write_lock.
Then, it will cause a deadlock when trying to acquire new_nodes[i]->
write_lock for freeing new_nodes[i] before return.

btree_gc_coalesce func details as follows:
	if alloc new_nodes[i] fails:
		goto out_nocoalesce;
	// obtain new_nodes[i]->write_lock
	mutex_lock(&new_nodes[i]->write_lock)
	// main coalescing process
	for (i = nodes - 1; i > 0; --i)
		[snipped]
		if coalescing process fails:
			// Here, directly goto out_nocoalesce
			 // tag will cause a deadlock
			goto out_nocoalesce;
		[snipped]
	// release new_nodes[i]->write_lock
	mutex_unlock(&new_nodes[i]->write_lock)
	// coalesing succ, return
	return;
out_nocoalesce:
	btree_node_free(new_nodes[i])	// free new_nodes[i]
	// obtain new_nodes[i]->write_lock
	mutex_lock(&new_nodes[i]->write_lock);
	// set flag for reuse
	clear_bit(BTREE_NODE_dirty, &ew_nodes[i]->flags);
	// release new_nodes[i]->write_lock
	mutex_unlock(&new_nodes[i]->write_lock);

To fix the problem, we add a new tag 'out_unlock_nocoalesce' for
releasing new_nodes[i]->write_lock before out_nocoalesce tag. If
coalescing process fails, we will go to out_unlock_nocoalesce tag
for releasing new_nodes[i]->write_lock before free new_nodes[i] in
out_nocoalesce tag.

(Coly Li helps to clean up commit log format.)

Fixes: 2a285686c1 ("bcache: btree locking rework")
Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-16 08:20:13 +09:00
Guoqing Jiang
65fcd8b25e md: don't flush workqueue unconditionally in md_open
[ Upstream commit f6766ff6af ]

We need to check mddev->del_work before flush workqueu since the purpose
of flush is to ensure the previous md is disappeared. Otherwise the similar
deadlock appeared if LOCKDEP is enabled, it is due to md_open holds the
bdev->bd_mutex before flush workqueue.

kernel: [  154.522645] ======================================================
kernel: [  154.522647] WARNING: possible circular locking dependency detected
kernel: [  154.522650] 5.6.0-rc7-lp151.27-default #25 Tainted: G           O
kernel: [  154.522651] ------------------------------------------------------
kernel: [  154.522653] mdadm/2482 is trying to acquire lock:
kernel: [  154.522655] ffff888078529128 ((wq_completion)md_misc){+.+.}, at: flush_workqueue+0x84/0x4b0
kernel: [  154.522673]
kernel: [  154.522673] but task is already holding lock:
kernel: [  154.522675] ffff88804efa9338 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x79/0x590
kernel: [  154.522691]
kernel: [  154.522691] which lock already depends on the new lock.
kernel: [  154.522691]
kernel: [  154.522694]
kernel: [  154.522694] the existing dependency chain (in reverse order) is:
kernel: [  154.522696]
kernel: [  154.522696] -> #4 (&bdev->bd_mutex){+.+.}:
kernel: [  154.522704]        __mutex_lock+0x87/0x950
kernel: [  154.522706]        __blkdev_get+0x79/0x590
kernel: [  154.522708]        blkdev_get+0x65/0x140
kernel: [  154.522709]        blkdev_get_by_dev+0x2f/0x40
kernel: [  154.522716]        lock_rdev+0x3d/0x90 [md_mod]
kernel: [  154.522719]        md_import_device+0xd6/0x1b0 [md_mod]
kernel: [  154.522723]        new_dev_store+0x15e/0x210 [md_mod]
kernel: [  154.522728]        md_attr_store+0x7a/0xc0 [md_mod]
kernel: [  154.522732]        kernfs_fop_write+0x117/0x1b0
kernel: [  154.522735]        vfs_write+0xad/0x1a0
kernel: [  154.522737]        ksys_write+0xa4/0xe0
kernel: [  154.522745]        do_syscall_64+0x64/0x2b0
kernel: [  154.522748]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
kernel: [  154.522749]
kernel: [  154.522749] -> #3 (&mddev->reconfig_mutex){+.+.}:
kernel: [  154.522752]        __mutex_lock+0x87/0x950
kernel: [  154.522756]        new_dev_store+0xc9/0x210 [md_mod]
kernel: [  154.522759]        md_attr_store+0x7a/0xc0 [md_mod]
kernel: [  154.522761]        kernfs_fop_write+0x117/0x1b0
kernel: [  154.522763]        vfs_write+0xad/0x1a0
kernel: [  154.522765]        ksys_write+0xa4/0xe0
kernel: [  154.522767]        do_syscall_64+0x64/0x2b0
kernel: [  154.522769]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
kernel: [  154.522770]
kernel: [  154.522770] -> #2 (kn->count#253){++++}:
kernel: [  154.522775]        __kernfs_remove+0x253/0x2c0
kernel: [  154.522778]        kernfs_remove+0x1f/0x30
kernel: [  154.522780]        kobject_del+0x28/0x60
kernel: [  154.522783]        mddev_delayed_delete+0x24/0x30 [md_mod]
kernel: [  154.522786]        process_one_work+0x2a7/0x5f0
kernel: [  154.522788]        worker_thread+0x2d/0x3d0
kernel: [  154.522793]        kthread+0x117/0x130
kernel: [  154.522795]        ret_from_fork+0x3a/0x50
kernel: [  154.522796]
kernel: [  154.522796] -> #1 ((work_completion)(&mddev->del_work)){+.+.}:
kernel: [  154.522800]        process_one_work+0x27e/0x5f0
kernel: [  154.522802]        worker_thread+0x2d/0x3d0
kernel: [  154.522804]        kthread+0x117/0x130
kernel: [  154.522806]        ret_from_fork+0x3a/0x50
kernel: [  154.522807]
kernel: [  154.522807] -> #0 ((wq_completion)md_misc){+.+.}:
kernel: [  154.522813]        __lock_acquire+0x1392/0x1690
kernel: [  154.522816]        lock_acquire+0xb4/0x1a0
kernel: [  154.522818]        flush_workqueue+0xab/0x4b0
kernel: [  154.522821]        md_open+0xb6/0xc0 [md_mod]
kernel: [  154.522823]        __blkdev_get+0xea/0x590
kernel: [  154.522825]        blkdev_get+0x65/0x140
kernel: [  154.522828]        do_dentry_open+0x1d1/0x380
kernel: [  154.522831]        path_openat+0x567/0xcc0
kernel: [  154.522834]        do_filp_open+0x9b/0x110
kernel: [  154.522836]        do_sys_openat2+0x201/0x2a0
kernel: [  154.522838]        do_sys_open+0x57/0x80
kernel: [  154.522840]        do_syscall_64+0x64/0x2b0
kernel: [  154.522842]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
kernel: [  154.522844]
kernel: [  154.522844] other info that might help us debug this:
kernel: [  154.522844]
kernel: [  154.522846] Chain exists of:
kernel: [  154.522846]   (wq_completion)md_misc --> &mddev->reconfig_mutex --> &bdev->bd_mutex
kernel: [  154.522846]
kernel: [  154.522850]  Possible unsafe locking scenario:
kernel: [  154.522850]
kernel: [  154.522852]        CPU0                    CPU1
kernel: [  154.522853]        ----                    ----
kernel: [  154.522854]   lock(&bdev->bd_mutex);
kernel: [  154.522856]                                lock(&mddev->reconfig_mutex);
kernel: [  154.522858]                                lock(&bdev->bd_mutex);
kernel: [  154.522860]   lock((wq_completion)md_misc);
kernel: [  154.522861]
kernel: [  154.522861]  *** DEADLOCK ***
kernel: [  154.522861]
kernel: [  154.522864] 1 lock held by mdadm/2482:
kernel: [  154.522865]  #0: ffff88804efa9338 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x79/0x590
kernel: [  154.522868]
kernel: [  154.522868] stack backtrace:
kernel: [  154.522873] CPU: 1 PID: 2482 Comm: mdadm Tainted: G           O      5.6.0-rc7-lp151.27-default #25
kernel: [  154.522875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
kernel: [  154.522878] Call Trace:
kernel: [  154.522881]  dump_stack+0x8f/0xcb
kernel: [  154.522884]  check_noncircular+0x194/0x1b0
kernel: [  154.522888]  ? __lock_acquire+0x1392/0x1690
kernel: [  154.522890]  __lock_acquire+0x1392/0x1690
kernel: [  154.522893]  lock_acquire+0xb4/0x1a0
kernel: [  154.522895]  ? flush_workqueue+0x84/0x4b0
kernel: [  154.522898]  flush_workqueue+0xab/0x4b0
kernel: [  154.522900]  ? flush_workqueue+0x84/0x4b0
kernel: [  154.522905]  ? md_open+0xb6/0xc0 [md_mod]
kernel: [  154.522908]  md_open+0xb6/0xc0 [md_mod]
kernel: [  154.522910]  __blkdev_get+0xea/0x590
kernel: [  154.522912]  ? bd_acquire+0xc0/0xc0
kernel: [  154.522914]  blkdev_get+0x65/0x140
kernel: [  154.522916]  ? bd_acquire+0xc0/0xc0
kernel: [  154.522918]  do_dentry_open+0x1d1/0x380
kernel: [  154.522921]  path_openat+0x567/0xcc0
kernel: [  154.522923]  ? __lock_acquire+0x380/0x1690
kernel: [  154.522926]  do_filp_open+0x9b/0x110
kernel: [  154.522929]  ? __alloc_fd+0xe5/0x1f0
kernel: [  154.522935]  ? kmem_cache_alloc+0x28c/0x630
kernel: [  154.522939]  ? do_sys_openat2+0x201/0x2a0
kernel: [  154.522941]  do_sys_openat2+0x201/0x2a0
kernel: [  154.522944]  do_sys_open+0x57/0x80
kernel: [  154.522946]  do_syscall_64+0x64/0x2b0
kernel: [  154.522948]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
kernel: [  154.522951] RIP: 0033:0x7f98d279d9ae

And md_alloc also flushed the same workqueue, but the thing is different
here. Because all the paths call md_alloc don't hold bdev->bd_mutex, and
the flush is necessary to avoid race condition, so leave it as it is.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-15 17:33:47 +09:00
Sunwook Eom
af9c7c8736 dm verity fec: fix hash block number in verity_fec_decode
commit ad4e80a639 upstream.

The error correction data is computed as if data and hash blocks
were concatenated. But hash block number starts from v->hash_start.
So, we have to calculate hash block number based on that.

Fixes: a739ff3f54 ("dm verity: add support for forward error correction")
Cc: stable@vger.kernel.org
Signed-off-by: Sunwook Eom <speed.eom@samsung.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-15 17:24:43 +09:00
Goldwyn Rodrigues
af6eacaf87 dm flakey: check for null arg_name in parse_features()
[ Upstream commit 7690e25302 ]

One can crash dm-flakey by specifying more feature arguments than the
number of features supplied.  Checking for null in arg_name avoids
this.

dmsetup create flakey-test --table "0 66076080 flakey /dev/sdb9 0 0 180 2 drop_writes"

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-15 17:15:29 +09:00
Shetty, Harshini X (EXT-Sony Mobile)
a293c56c1a dm verity fec: fix memory leak in verity_fec_dtr
commit 75fa601934 upstream.

Fix below kmemleak detected in verity_fec_ctr. output_pool is
allocated for each dm-verity-fec device. But it is not freed when
dm-table for the verity target is removed. Hence free the output
mempool in destructor function verity_fec_dtr.

unreferenced object 0xffffffffa574d000 (size 4096):
  comm "init", pid 1667, jiffies 4294894890 (age 307.168s)
  hex dump (first 32 bytes):
    8e 36 00 98 66 a8 0b 9b 00 00 00 00 00 00 00 00  .6..f...........
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<0000000060e82407>] __kmalloc+0x2b4/0x340
    [<00000000dd99488f>] mempool_kmalloc+0x18/0x20
    [<000000002560172b>] mempool_init_node+0x98/0x118
    [<000000006c3574d2>] mempool_init+0x14/0x20
    [<0000000008cb266e>] verity_fec_ctr+0x388/0x3b0
    [<000000000887261b>] verity_ctr+0x87c/0x8d0
    [<000000002b1e1c62>] dm_table_add_target+0x174/0x348
    [<000000002ad89eda>] table_load+0xe4/0x328
    [<000000001f06f5e9>] dm_ctl_ioctl+0x3b4/0x5a0
    [<00000000bee5fbb7>] do_vfs_ioctl+0x5dc/0x928
    [<00000000b475b8f5>] __arm64_sys_ioctl+0x70/0x98
    [<000000005361e2e8>] el0_svc_common+0xa0/0x158
    [<000000001374818f>] el0_svc_handler+0x6c/0x88
    [<000000003364e9f4>] el0_svc+0x8/0xc
    [<000000009d84cec9>] 0xffffffffffffffff

Fixes: a739ff3f54 ("dm verity: add support for forward error correction")
Depends-on: 6f1c819c21 ("dm: convert to bioset_init()/mempool_init()")
Cc: stable@vger.kernel.org
Signed-off-by: Harshini Shetty <harshini.x.shetty@sony.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-15 17:14:45 +09:00
Mikulas Patocka
7e86405ab9 dm cache: fix a crash due to incorrect work item cancelling
commit 7cdf6a0aae upstream.

The crash can be reproduced by running the lvm2 testsuite test
lvconvert-thin-external-cache.sh for several minutes, e.g.:
  while :; do make check T=shell/lvconvert-thin-external-cache.sh; done

The crash happens in this call chain:
do_waker -> policy_tick -> smq_tick -> end_hotspot_period -> clear_bitset
-> memset -> __memset -- which accesses an invalid pointer in the vmalloc
area.

The work entry on the workqueue is executed even after the bitmap was
freed. The problem is that cancel_delayed_work doesn't wait for the
running work item to finish, so the work item can continue running and
re-submitting itself even after cache_postsuspend. In order to make sure
that the work item won't be running, we must use cancel_delayed_work_sync.

Also, change flush_workqueue to drain_workqueue, so that if some work item
submits itself or another work item, we are properly waiting for both of
them.

Fixes: c6b4fcbad0 ("dm: add cache target")
Cc: stable@vger.kernel.org # v3.9
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-15 17:03:20 +09:00
Coly Li
e94308de2e bcache: explicity type cast in bset_bkey_last()
[ Upstream commit 7c02b0055f ]

In bset.h, macro bset_bkey_last() is defined as,
    bkey_idx((struct bkey *) (i)->d, (i)->keys)

Parameter i can be variable type of data structure, the macro always
works once the type of struct i has member 'd' and 'keys'.

bset_bkey_last() is also used in macro csum_set() to calculate the
checksum of a on-disk data structure. When csum_set() is used to
calculate checksum of on-disk bcache super block, the parameter 'i'
data type is struct cache_sb_disk. Inside struct cache_sb_disk (also in
struct cache_sb) the member keys is __u16 type. But bkey_idx() expects
unsigned int (a 32bit width), so there is problem when sending
parameters via stack to call bkey_idx().

Sparse tool from Intel 0day kbuild system reports this incompatible
problem. bkey_idx() is part of user space API, so the simplest fix is
to cast the (i)->keys to unsigned int type in macro bset_bkey_last().

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-15 16:59:51 +09:00
Mike Snitzer
311e12b110 dm: fix potential for q->make_request_fn NULL pointer
commit 47ace7e012 upstream.

Move blk_queue_make_request() to dm.c:alloc_dev() so that
q->make_request_fn is never NULL during the lifetime of a DM device
(even one that is created without a DM table).

Otherwise generic_make_request() will crash simply by doing:
  dmsetup create -n test
  mount /dev/dm-N /mnt

While at it, move ->congested_data initialization out of
dm.c:alloc_dev() and into the bio-based specific init method.

Reported-by: Stefan Bader <stefan.bader@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/1860231
Fixes: ff36ab3458 ("dm: remove request-based logic from make_request_fn wrapper")
Depends-on: c12c9a3c38 ("dm: various cleanups to md->queue initialization code")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
[smb: adjusted for context and dm_init_md_queue() exitsting in older
      kernels, and congested_data embedded in backing_dev_info]
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-15 16:57:22 +09:00
Joe Thornber
e273d7332c dm space map common: fix to ensure new block isn't already in use
commit 4feaef830d upstream.

The space-maps track the reference counts for disk blocks allocated by
both the thin-provisioning and cache targets.  There are variants for
tracking metadata blocks and data blocks.

Transactionality is implemented by never touching blocks from the
previous transaction, so we can rollback in the event of a crash.

When allocating a new block we need to ensure the block is free (has
reference count of 0) in both the current and previous transaction.
Prior to this fix we were doing this by searching for a free block in
the previous transaction, and relying on a 'begin' counter to track
where the last allocation in the current transaction was.  This
'begin' field was not being updated in all code paths (eg, increment
of a data block reference count due to breaking sharing of a neighbour
block in the same btree leaf).

This fix keeps the 'begin' field, but now it's just a hint to speed up
the search.  Instead the current transaction is searched for a free
block, and then the old transaction is double checked to ensure it's
free.  Much simpler.

This fixes reports of sm_disk_new_block()'s BUG_ON() triggering when
DM thin-provisioning's snapshots are heavily used.

Reported-by: Eric Wheeler <dm-devel@lists.ewheeler.net>
Cc: stable@vger.kernel.org
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-15 16:55:48 +09:00
Andy Shevchenko
b4969b0cfa md: Avoid namespace collision with bitmap API
commit e64e4018d5 upstream.

bitmap API (include/linux/bitmap.h) has 'bitmap' prefix for its methods.

On the other hand MD bitmap API is special case.
Adding 'md' prefix to it to avoid name space collision.

No functional changes intended.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Shaohua Li <shli@kernel.org>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
[only take the bitmap_free change for stable - gregkh]
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-15 16:48:18 +09:00
Dan Carpenter
e5393f586d bcache: silence static checker warning
commit da22f0eea5 upstream.

In olden times, closure_return() used to have a hidden return built in.
We removed the hidden return but forgot to add a new return here.  If
"c" were NULL we would oops on the next line, but fortunately "c" is
never NULL.  Let's just remove the if statement.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-15 16:48:15 +09:00
Mikulas Patocka
44f168793d block: fix an integer overflow in logical block size
commit ad6bf88a6c upstream.

Logical block size has type unsigned short. That means that it can be at
most 32768. However, there are architectures that can run with 64k pages
(for example arm64) and on these architectures, it may be possible to
create block devices with 64k block size.

For exmaple (run this on an architecture with 64k pages):

Mount will fail with this error because it tries to read the superblock using 2-sector
access:
  device-mapper: writecache: I/O is not aligned, sector 2, size 1024, block size 65536
  EXT4-fs (dm-0): unable to read superblock

This patch changes the logical block size from unsigned short to unsigned
int to avoid the overflow.

Cc: stable@vger.kernel.org
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-15 16:34:53 +09:00
Zhiqiang Liu
2e9cc430d5 md: raid1: check rdev before reference in raid1_sync_request func
[ Upstream commit 028288df63 ]

In raid1_sync_request func, rdev should be checked before reference.

Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-15 16:30:05 +09:00
Coly Li
79861a65de bcache: at least try to shrink 1 node in bch_mca_scan()
[ Upstream commit 9fcc34b1a6 ]

In bch_mca_scan(), the number of shrinking btree node is calculated
by code like this,
	unsigned long nr = sc->nr_to_scan;

        nr /= c->btree_pages;
        nr = min_t(unsigned long, nr, mca_can_free(c));
variable sc->nr_to_scan is number of objects (here is bcache B+tree
nodes' number) to shrink, and pointer variable sc is sent from memory
management code as parametr of a callback.

If sc->nr_to_scan is smaller than c->btree_pages, after the above
calculation, variable 'nr' will be 0 and nothing will be shrunk. It is
frequeently observed that only 1 or 2 is set to sc->nr_to_scan and make
nr to be zero. Then bch_mca_scan() will do nothing more then acquiring
and releasing mutex c->bucket_lock.

This patch checkes whether nr is 0 after the above calculation, if 0
is the result then set 1 to variable 'n'. Then at least bch_mca_scan()
will try to shrink a single B+tree node.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-15 16:21:14 +09:00
Hou Tao
616e4c8189 dm btree: increase rebalance threshold in __rebalance2()
commit 474e559567 upstream.

We got the following warnings from thin_check during thin-pool setup:

  $ thin_check /dev/vdb
  examining superblock
  examining devices tree
    missing devices: [1, 84]
      too few entries in btree_node: 41, expected at least 42 (block 138, max_entries = 126)
  examining mapping tree

The phenomenon is the number of entries in one node of details_info tree is
less than (max_entries / 3). And it can be easily reproduced by the following
procedures:

  $ new a thin pool
  $ presume the max entries of details_info tree is 126
  $ new 127 thin devices (e.g. 1~127) to make the root node being full
    and then split
  $ remove the first 43 (e.g. 1~43) thin devices to make the children
    reblance repeatedly
  $ stop the thin pool
  $ thin_check

The root cause is that the B-tree removal procedure in __rebalance2()
doesn't guarantee the invariance: the minimal number of entries in
non-root node should be >= (max_entries / 3).

Simply fix the problem by increasing the rebalance threshold to
make sure the number of entries in each child will be greater
than or equal to (max_entries / 3 + 1), so no matter which
child is used for removal, the number will still be valid.

Cc: stable@vger.kernel.org
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-15 16:17:56 +09:00
Sweet Tea
dde3614b04 dm flakey: Properly corrupt multi-page bios.
[ Upstream commit a00f5276e2 ]

The flakey target is documented to be able to corrupt the Nth byte in
a bio, but does not corrupt byte indices after the first biovec in the
bio. Change the corrupting function to actually corrupt the Nth byte
no matter in which biovec that index falls.

A test device generating two-page bios, atop a flakey device configured
to corrupt a byte index on the second page, verified both the failure
to corrupt before this patch and the expected corruption after this
change.

Signed-off-by: John Dorminy <jdorminy@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-15 16:06:46 +09:00
Bart Van Assche
7ead68f7b5 dm: use blk_set_queue_dying() in __dm_destroy()
commit 2e91c36941 upstream.

After QUEUE_FLAG_DYING has been set any code that is waiting in
get_request() should be woken up.  But to get this behaviour
blk_set_queue_dying() must be used instead of only setting
QUEUE_FLAG_DYING.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-15 15:19:15 +09:00
Shenghui Wang
c310af4bab bcache: recal cached_dev_sectors on detach
[ Upstream commit 46010141da ]

Recal cached_dev_sectors on cached_dev detached, as recal done on
cached_dev attached.

Update the cached_dev_sectors before bcache_device_detach called
as bcache_device_detach will set bcache_device->c to NULL.

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-15 15:12:23 +09:00
Kent Overstreet
17b799c3ab dm: Use kzalloc for all structs with embedded biosets/mempools
[ Upstream commit d377535405 ]

mempool_init()/bioset_init() require that the mempools/biosets be zeroed
first; they probably should not _require_ this, but not allocating those
structs with kzalloc is a fairly nonsensical thing to do (calling
mempool_exit()/bioset_exit() on an uninitialized mempool/bioset is legal
and safe, but only works if said memory was zeroed.)

Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-15 14:54:47 +09:00
Mikulas Patocka
70e47aca0d dm snapshot: rework COW throttling to fix deadlock
[ Upstream commit b21555786f ]

Commit 721b1d98fb ("dm snapshot: Fix excessive memory usage and
workqueue stalls") introduced a semaphore to limit the maximum number of
in-flight kcopyd (COW) jobs.

The implementation of this throttling mechanism is prone to a deadlock:

1. One or more threads write to the origin device causing COW, which is
   performed by kcopyd.

2. At some point some of these threads might reach the s->cow_count
   semaphore limit and block in down(&s->cow_count), holding a read lock
   on _origins_lock.

3. Someone tries to acquire a write lock on _origins_lock, e.g.,
   snapshot_ctr(), which blocks because the threads at step (2) already
   hold a read lock on it.

4. A COW operation completes and kcopyd runs dm-snapshot's completion
   callback, which ends up calling pending_complete().
   pending_complete() tries to resubmit any deferred origin bios. This
   requires acquiring a read lock on _origins_lock, which blocks.

   This happens because the read-write semaphore implementation gives
   priority to writers, meaning that as soon as a writer tries to enter
   the critical section, no readers will be allowed in, until all
   writers have completed their work.

   So, pending_complete() waits for the writer at step (3) to acquire
   and release the lock. This writer waits for the readers at step (2)
   to release the read lock and those readers wait for
   pending_complete() (the kcopyd thread) to signal the s->cow_count
   semaphore: DEADLOCK.

The above was thoroughly analyzed and documented by Nikos Tsironis as
part of his initial proposal for fixing this deadlock, see:
https://www.redhat.com/archives/dm-devel/2019-October/msg00001.html

Fix this deadlock by reworking COW throttling so that it waits without
holding any locks. Add a variable 'in_progress' that counts how many
kcopyd jobs are running. A function wait_for_in_progress() will sleep if
'in_progress' is over the limit. It drops _origins_lock in order to
avoid the deadlock.

Reported-by: Guruswamy Basavaiah <guru2018@gmail.com>
Reported-by: Nikos Tsironis <ntsironis@arrikto.com>
Reviewed-by: Nikos Tsironis <ntsironis@arrikto.com>
Tested-by: Nikos Tsironis <ntsironis@arrikto.com>
Fixes: 721b1d98fb ("dm snapshot: Fix excessive memory usage and workqueue stalls")
Cc: stable@vger.kernel.org # v5.0+
Depends-on: 4a3f111a73a8c ("dm snapshot: introduce account_start_copy() and account_end_copy()")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-15 14:54:46 +09:00
Mikulas Patocka
bba7c85881 dm snapshot: introduce account_start_copy() and account_end_copy()
[ Upstream commit a2f83e8b0c ]

This simple refactoring moves code for modifying the semaphore cow_count
into separate functions to prepare for changes that will extend these
methods to provide for a more sophisticated mechanism for COW
throttling.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-15 14:54:44 +09:00
Mikulas Patocka
d9e0dcbe04 dm snapshot: use mutex instead of rw_semaphore
[ Upstream commit ae1093be5a ]

The rw_semaphore is acquired for read only in two places, neither is
performance-critical.  So replace it with a mutex -- which is more
efficient.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-15 14:54:42 +09:00
Xiao Ni
243b1eff4d md/raid6: Set R5_ReadError when there is read failure on parity disk
commit 143f6e733b upstream.

7471fb77ce ("md/raid6: Fix anomily when recovering a single device in
RAID6.") avoids rereading P when it can be computed from other members.
However, this misses the chance to re-write the right data to P. This
patch sets R5_ReadError if the re-read fails.

Also, when re-read is skipped, we also missed the chance to reset
rdev->read_errors to 0. It can fail the disk when there are many read
errors on P member disk (other disks don't have read error)

V2: upper layer read request don't read parity/Q data. So there is no
need to consider such situation.

This is Reported-by: kbuild test robot <lkp@intel.com>

Fixes: 7471fb77ce ("md/raid6: Fix anomily when recovering a single device in RAID6.")
Cc: <stable@vger.kernel.org> #4.4+
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-15 14:34:56 +09:00
Yufen Yu
2885bb608c md/raid1: fail run raid1 array when active disk less than one
[ Upstream commit 07f1a6850c ]

When run test case:
  mdadm -CR /dev/md1 -l 1 -n 4 /dev/sd[a-d] --assume-clean --bitmap=internal
  mdadm -S /dev/md1
  mdadm -A /dev/md1 /dev/sd[b-c] --run --force

  mdadm --zero /dev/sda
  mdadm /dev/md1 -a /dev/sda

  echo offline > /sys/block/sdc/device/state
  echo offline > /sys/block/sdb/device/state
  sleep 5
  mdadm -S /dev/md1

  echo running > /sys/block/sdb/device/state
  echo running > /sys/block/sdc/device/state
  mdadm -A /dev/md1 /dev/sd[a-c] --run --force

mdadm run fail with kernel message as follow:
[  172.986064] md: kicking non-fresh sdb from array!
[  173.004210] md: kicking non-fresh sdc from array!
[  173.022383] md/raid1:md1: active with 0 out of 4 mirrors
[  173.022406] md1: failed to create bitmap (-5)

In fact, when active disk in raid1 array less than one, we
need to return fail in raid1_run().

Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-15 14:34:01 +09:00
Guoqing Jiang
d37bf075b3 md: don't set In_sync if array is frozen
[ Upstream commit 062f5b2ae1 ]

When a disk is added to array, the following path is called in mdadm.

Manage_subdevs -> sysfs_freeze_array
               -> Manage_add
               -> sysfs_set_str(&info, NULL, "sync_action","idle")

Then from kernel side, Manage_add invokes the path (add_new_disk ->
validate_super = super_1_validate) to set In_sync flag.

Since In_sync means "device is in_sync with rest of array", and the new
added disk need to resync thread to help the synchronization of data.
And md_reap_sync_thread would call spare_active to set In_sync for the
new added disk finally. So don't set In_sync if array is in frozen.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-15 14:33:19 +09:00
Guoqing Jiang
5fcff55676 md: don't call spare_active in md_reap_sync_thread if all member devices can't work
[ Upstream commit 0d8ed0e9bf ]

When add one disk to array, the md_reap_sync_thread is responsible
to activate the spare and set In_sync flag for the new member in
spare_active().

But if raid1 has one member disk A, and disk B is added to the array.
Then we offline A before all the datas are synchronized from A to B,
obviously B doesn't have the latest data as A, but B is still marked
with In_sync flag.

So let's not call spare_active under the condition, otherwise B is
still showed with 'U' state which is not correct.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-15 14:33:18 +09:00
Mikulas Patocka
d147e63cd7 dm table: fix invalid memory accesses with too high sector number
commit 1cfd5d3399 upstream.

If the sector number is too high, dm_table_find_target() should return a
pointer to a zeroed dm_target structure (the caller should test it with
dm_target_is_valid).

However, for some table sizes, the code in dm_table_find_target() that
performs btree lookup will access out of bound memory structures.

Fix this bug by testing the sector number at the beginning of
dm_table_find_target(). Also, add an "inline" keyword to the function
dm_table_get_size() because this is a hot path.

Fixes: 512875bd96 ("dm: table detect io beyond device")
Cc: stable@vger.kernel.org
Reported-by: Zhang Tao <kontais@zoho.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-15 14:21:16 +09:00