Commit Graph

1185645 Commits

Author SHA1 Message Date
Yu Kuai
010444623e md/raid10: prevent soft lockup while flush writes
Currently, there is no limit for raid1/raid10 plugged bio. While flushing
writes, raid1 has cond_resched() while raid10 doesn't, and too many
writes can cause soft lockup.

Follow up soft lockup can be triggered easily with writeback test for
raid10 with ramdisks:

watchdog: BUG: soft lockup - CPU#10 stuck for 27s! [md0_raid10:1293]
Call Trace:
 <TASK>
 call_rcu+0x16/0x20
 put_object+0x41/0x80
 __delete_object+0x50/0x90
 delete_object_full+0x2b/0x40
 kmemleak_free+0x46/0xa0
 slab_free_freelist_hook.constprop.0+0xed/0x1a0
 kmem_cache_free+0xfd/0x300
 mempool_free_slab+0x1f/0x30
 mempool_free+0x3a/0x100
 bio_free+0x59/0x80
 bio_put+0xcf/0x2c0
 free_r10bio+0xbf/0xf0
 raid_end_bio_io+0x78/0xb0
 one_write_done+0x8a/0xa0
 raid10_end_write_request+0x1b4/0x430
 bio_endio+0x175/0x320
 brd_submit_bio+0x3b9/0x9b7 [brd]
 __submit_bio+0x69/0xe0
 submit_bio_noacct_nocheck+0x1e6/0x5a0
 submit_bio_noacct+0x38c/0x7e0
 flush_pending_writes+0xf0/0x240
 raid10d+0xac/0x1ed0

Fix the problem by adding cond_resched() to raid10 like what raid1 did.

Note that unlimited plugged bio still need to be optimized, for example,
in the case of lots of dirty pages writeback, this will take lots of
memory and io will spend a long time in plug, hence io latency is bad.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529131106.2123367-2-yukuai1@huaweicloud.com
2023-06-13 15:25:43 -07:00
Li Nan
2ae6aaf769 md/raid10: fix io loss while replacement replace rdev
When removing a disk with replacement, the replacement will be used to
replace rdev. During this process, there is a brief window in which both
rdev and replacement are read as NULL in raid10_write_request(). This
will result in io not being submitted but it should be.

  //remove				//write
  raid10_remove_disk			raid10_write_request
   mirror->rdev = NULL
					 read rdev -> NULL
   mirror->rdev = mirror->replacement
   mirror->replacement = NULL
					 read replacement -> NULL

Fix it by reading replacement first and rdev later, meanwhile, use smp_mb()
to prevent memory reordering.

Fixes: 475b0321a4 ("md/raid10: writes should get directed to replacement as well as original.")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230602091839.743798-3-linan666@huaweicloud.com
2023-06-13 15:25:43 -07:00
Li Nan
8d355a46c1 md/raid10: Do not add spare disk when recovery fails
In raid10_sync_request(), if data cannot be read from any disk for
recovery, it will go to 'giveup' and let 'chunks_skipped' + 1. After
multiple 'giveup', when 'chunks_skipped >= geo.raid_disks', it will
return 'max_sector', indicating that the recovery has been completed.
However, the recovery is just aborted and the data remains inconsistent.

Fix it by setting mirror->recovery_disabled, which will prevent the spare
disk from being added to this mirror. The same issue also exists during
resync, it will be fixed afterwards.

Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230602091839.743798-2-linan666@huaweicloud.com
2023-06-13 15:25:43 -07:00
Li Nan
4d8a5754a6 md/raid10: clean up md_add_new_disk()
Commit 1a855a0606 ("md: fix bug with re-adding of partially recovered
device.") only add device which is set to In_sync. But it let devices
without metadata cannot be added when they should be.

Commit bf572541ab ("md: fix regression with re-adding devices to arrays
with no metadata") fix the above issue, it set device without metadata to
In_sync when add new disk.

However, after commit f466722ca6 ("md: Change handling of save_raid_disk
and metadata update during recovery.") deletes changes of the first patch,
setting In_sync for devcie without metadata is meanless because the flag
will be cleared soon and will not be used during this period. Clean it up.

Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230527101851.3266500-2-linan666@huaweicloud.com
2023-06-13 15:25:42 -07:00
Li Nan
6090368abc md/raid10: prioritize adding disk to 'removed' mirror
When add a new disk to raid10, it will traverse conf->mirror from start
and find one of the following mirror to add:
  1. mirror->rdev is set to WantReplacement and it have no replacement,
     set new disk to mirror->replacement.
  2. no mirror->rdev, set new disk to mirror->rdev.

There is a array as below (sda is set to WantReplacement):

    Number   Major   Minor   RaidDevice State
       0       8        0        0      active sync set-A   /dev/sda
       -       0        0        1      removed
       2       8       32        2      active sync set-A   /dev/sdc
       3       8       48        3      active sync set-B   /dev/sdd

Use 'mdadm --add' to add a new disk to this array, the new disk will
become sda's replacement instead of add to removed position, which is
confusing for users. Meanwhile, after new disk recovery success, sda
will be set to Faulty.

Prioritize adding disk to 'removed' mirror is a better choice. In the
above scenario, the behavior is the same as before, except sda will not
be deleted. Before other disks are added, continued use sda is more
reliable.

Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230527092007.3008856-1-linan666@huaweicloud.com
2023-06-13 15:25:42 -07:00
Li Nan
59f8f0b54c md/raid10: improve code of mrdev in raid10_sync_request
'need_recover' and 'mrdev' are equivalent in raid10_sync_request(), and
inc mrdev->nr_pending is unreasonable if don't need recovery. Replace
'need_recover' with 'mrdev', and only inc nr_pending when needed.

Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230527072218.2365857-3-linan666@huaweicloud.com
2023-06-13 15:25:42 -07:00
Li Nan
34817a2441 md/raid10: fix null-ptr-deref of mreplace in raid10_sync_request
There are two check of 'mreplace' in raid10_sync_request(). In the first
check, 'need_replace' will be set and 'mreplace' will be used later if
no-Faulty 'mreplace' exists, In the second check, 'mreplace' will be
set to NULL if it is Faulty, but 'need_replace' will not be changed
accordingly. null-ptr-deref occurs if Faulty is set between two check.

Fix it by merging two checks into one. And replace 'need_replace' with
'mreplace' because their values are always the same.

Fixes: ee37d7314a ("md/raid10: Fix raid10 replace hang when new added disk faulty")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230527072218.2365857-2-linan666@huaweicloud.com
2023-06-13 15:25:42 -07:00
Yu Kuai
75aa7a1b8f md/raid5: don't start reshape when recovery or replace is in progress
When recovery is interrupted (reboot, etc.) check for MD_RECOVERY_RUNNING
is not enough to tell recovery is in progress. Also check recovery_cp
before starting reshape.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529133410.2125914-1-yukuai1@huaweicloud.com
2023-06-13 15:25:41 -07:00
Yu Kuai
4469315439 md: protect md_thread with rcu
Currently, there are many places that md_thread can be accessed without
protection, following are known scenarios that can cause
null-ptr-dereference or uaf:

1) sync_thread that is allocated and started from md_start_sync()
2) mddev->thread can be accessed directly from timeout_store() and
   md_bitmap_daemon_work()
3) md_unregister_thread() from action_store().

Currently, a global spinlock 'pers_lock' is borrowed to protect
'mddev->thread' in some places, this problem can be fixed likewise,
however, use a global lock for all the cases is not good.

Fix this problem by protecting all md_thread with rcu.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230523021017.3048783-6-yukuai1@huaweicloud.com
2023-06-13 15:25:39 -07:00
Yu Kuai
4eeb6535cd md/bitmap: factor out a helper to set timeout
Register/unregister 'mddev->thread' are both under 'reconfig_mutex',
however, some context didn't hold the mutex to access mddev->thread,
which can cause null-ptr-deference:

1) md_bitmap_daemon_work() can be called from md_check_recovery() where
'reconfig_mutex' is not held, deference 'mddev->thread' might cause
null-ptr-deference, because md_unregister_thread() reset the pointer
before stopping the thread.

2) timeout_store() access 'mddev->thread' multiple times,
null-ptr-deference can be triggered if 'mddev->thread' is reset in the
middle.

This patch factor out a helper to set timeout, the new helper always
check if 'mddev->thread' is null first, so that problem 1 can be fixed.

Now that this helper only access 'mddev->thread' once, but it's possible
that 'mddev->thread' can be freed while this helper is still in progress,
hence the problem is not fixed yet. Follow up patches will fix this by
protecting md_thread with rcu.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230523021017.3048783-5-yukuai1@huaweicloud.com
2023-06-13 15:25:13 -07:00
Yu Kuai
c333673a78 md/bitmap: always wake up md_thread in timeout_store
md_wakeup_thread() can handle the case that pass in md_thread is NULL,
the only difference is that md_wakeup_thread() will be called when
current timeout is 'MAX_SCHEDULE_TIMEOUT', this should not matter
because timeout_store() is not hot path, and the daemon process is
woke up more than demand from other context already.

Prepare to factor out a helper to set timeout.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230523021017.3048783-4-yukuai1@huaweicloud.com
2023-06-13 15:25:13 -07:00
Yu Kuai
955a257d69 dm-raid: remove useless checking in raid_message()
md_wakeup_thread() handle the case that pass in md_thread is NULL, there
is no need to check this.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230523021017.3048783-3-yukuai1@huaweicloud.com
2023-06-13 15:25:13 -07:00
Yu Kuai
e5e9b9cb71 md: factor out a helper to wake up md_thread directly
md_wakeup_thread() can't wakeup md_thread->tsk if md_thread->run is
still in progress, and in some cases md_thread->tsk need to be woke up
directly, like md_set_readonly() and do_md_stop().

Commit 9dfbdafda3 ("md: unlock mddev before reap sync_thread in
action_store") introduce a new scenario where unregister sync_thread is
not protected by 'reconfig_mutex', this can cause null-ptr-deference in
theroy:

t1: md_set_readonly		t2: action_store
				md_unregister_thread
				// 'reconfig_mutex' is not held
// 'reconfig_mutex' is held by caller
if (mddev->sync_thread)
				 thread = *threadp
				 *threadp = NULL
 wake_up_process(mddev->sync_thread->tsk)
 // null-ptr-deference

Fix this problem by factoring out a helper to wake up md_thread directly,
so that 'sync_thread' won't be accessed multiple times from the reader
side. This helper also prepare to protect md_thread with rcu.

Noted that later patches is going to fix that unregister sync_thread is
not protected by 'reconfig_mutex' from action_store().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230523021017.3048783-2-yukuai1@huaweicloud.com
2023-06-13 15:25:11 -07:00
Yu Kuai
3ce94ce5d0 md: fix duplicate filename for rdev
Commit 5792a2856a ("[PATCH] md: avoid a deadlock when removing a device
from an md array via sysfs") delays the deletion of rdev, however, this
introduces a window that rdev can be added again while the deletion is
not done yet, and sysfs will complain about duplicate filename.

Follow up patches try to fix this problem by flushing workqueue, however,
flush_rdev_wq() is just dead code, the progress in
md_kick_rdev_from_array():

1) list_del_rcu(&rdev->same_set);
2) synchronize_rcu();
3) queue_work(md_rdev_misc_wq, &rdev->del_work);

So in flush_rdev_wq(), if rdev is found in the list, work_pending() can
never pass, in the meantime, if work is queued, then rdev can never be
found in the list.

flush_rdev_wq() can be replaced by flush_workqueue() directly, however,
this approach is not good:
- the workqueue is global, this synchronization for all raid disks is
  not necessary.
- flush_workqueue can't be called under 'reconfig_mutex', there is still
  a small window between flush_workqueue() and mddev_lock() that other
  contexts can queue new work, hence the problem is not solved completely.

sysfs already has apis to support delete itself through writer, and
these apis, specifically sysfs_break/unbreak_active_protection(), is used
to support deleting rdev synchronously. Therefore, the above commit can be
reverted, and sysfs duplicate filename can be avoided.

A new mdadm regression test is proposed as well([1]).

[1] https://lore.kernel.org/linux-raid/20230428062845.1975462-1-yukuai1@huaweicloud.com/

Fixes: 5792a2856a ("[PATCH] md: avoid a deadlock when removing a device from an md array via sysfs")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230523012727.3042247-1-yukuai1@huaweicloud.com
2023-06-13 15:24:14 -07:00
Li Nan
f8b20a4054 md/raid10: fix wrong setting of max_corr_read_errors
There is no input check when echo md/max_read_errors and overflow might
occur. Add check of input number.

Fixes: 1e50915fe0 ("raid: improve MD/raid10 handling of correctable read errors.")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230522072535.1523740-3-linan666@huaweicloud.com
2023-06-13 15:13:22 -07:00
Li Nan
6beb489b2e md/raid10: fix overflow of md/safe_mode_delay
There is no input check when echo md/safe_mode_delay in safe_delay_store().
And msec might also overflow when HZ < 1000 in safe_delay_show(), Fix it by
checking overflow in safe_delay_store() and use unsigned long conversion in
safe_delay_show().

Fixes: 72e02075a3 ("md: factor out parsing of fixed-point numbers")
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230522072535.1523740-2-linan666@huaweicloud.com
2023-06-13 15:13:21 -07:00
Yu Kuai
868bba54a3 md/raid5: fix a deadlock in the case that reshape is interrupted
If reshape is in progress and io across reshape_position is issued, such
io will wait for reshape to make progress(see details in the case that
make_stripe_request() return STRIPE_SCHEDULE_AND_RETRY).

It has been reported several times that if system reboot while growing
raid5 to raid6, array assemble will hang infinitely([1, 2]). This is
because following deadlock is triggered:

1) a normal io is waiting for reshape to progress, this io can be from
   system-udevd or mdadm.
2) while assemble, mdadm tries to suspend the array, hence
   'reconfig_mutex' is held and mddev_suspend() must wait for normal io
   to be done.
3) daemon thread can't start reshape because 'reconfig_mutex' can't be
   held.

1) and 3) is unbreakable because they're foundation design. In order to
break 2), following is possible solutions that I can think of:

a) Let mddev_suspend() fail is not a good option, because this will
   break many scenarios since mddev_suspend() doesn't fail before.
b) Fail the io that is waiting for reshape to make progress from
   mddev_suspend().
c) Return false for the io that is waiting for reshape to make
   progress from raid5_make_request(), and these io will wait for
   suspend to be done in md_handle_request(), where 'active_io' is
   not grabbed.

c) sounds better than b), however, b) is used because it's easy and
straightforward, and it's verified that mdadm can assemble in this case.
On the other hand, c) breaks the logic that mddev_suspend() will wait
for submitted io to be completely handled.

Fix the problem by checking reshape in mddev_suspend(), if reshape can't
make progress and there are still some io waiting for reshape, fail
those io.

[1] https://lore.kernel.org/all/CAFig2csUV2QiomUhj_t3dPOgV300dbQ6XtM9ygKPdXJFSH__Nw@mail.gmail.com/
[2] https://lore.kernel.org/all/CAO2ABipzbw6QL5eNa44CQHjiVa-LTvS696Mh9QaTw+qsUKFUCw@mail.gmail.com/

Reported-by: Jove <jovetoo@gmail.com>
Reported-by: David Gilmour <dgilmour76@gmail.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230512015610.821290-6-yukuai1@huaweicloud.com
2023-06-13 15:13:21 -07:00
Yu Kuai
3e00777d51 md: add a new api prepare_suspend() in md_personality
There are no functional changes, the new api will be used later to do
special handling for raid456 in md_suspend().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230512015610.821290-5-yukuai1@huaweicloud.com
2023-06-13 15:13:21 -07:00
Yu Kuai
431e61257d md: export md_is_rdwr() and is_md_suspended()
The two apis will be used later to fix a deadlock in raid456, there are
no functional changes.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230512015610.821290-4-yukuai1@huaweicloud.com
2023-06-13 15:13:21 -07:00
Yu Kuai
873f50ece4 md: fix data corruption for raid456 when reshape restart while grow up
Currently, if reshape is interrupted, echo "reshape" to sync_action will
restart reshape from scratch, for example:

echo frozen > sync_action
echo reshape > sync_action

This will corrupt data before reshape_position if the array is growing,
fix the problem by continue reshape from reshape_position.

Reported-by: Peter Neuwirth <reddunur@online.de>
Link: https://lore.kernel.org/linux-raid/e2f96772-bfbc-f43b-6da1-f520e5164536@online.de/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230512015610.821290-3-yukuai1@huaweicloud.com
2023-06-13 15:13:20 -07:00
Yu Kuai
46038b30b3 md/raid5: don't allow replacement while reshape is in progress
If reshape is interrupted(for example, echo frozen to sync_action), then
rdev replacement can be set. It's safe because reshape is always prior to
resync in md_check_recovery(). However, if system reboots, then kernel will
complain cannot handle concurrent replacement and reshape and this array
is not able to assemble anymore.

Fix this problem by don't allow replacement until reshape is done.

Reported-by: Peter Neuwirth <reddunur@online.de>
Link: https://lore.kernel.org/linux-raid/e2f96772-bfbc-f43b-6da1-f520e5164536@online.de/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230512015610.821290-2-yukuai1@huaweicloud.com
2023-06-13 15:13:20 -07:00
Li Nan
301867b1c1 md/raid10: check slab-out-of-bounds in md_bitmap_get_counter
If we write a large number to md/bitmap_set_bits, md_bitmap_checkpage()
will return -EINVAL because 'page >= bitmap->pages', but the return value
was not checked immediately in md_bitmap_get_counter() in order to set
*blocks value and slab-out-of-bounds occurs.

Move check of 'page >= bitmap->pages' to md_bitmap_get_counter() and
return directly if true.

Fixes: ef42567335 ("md/bitmap: optimise scanning of empty bitmaps.")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230515134808.3936750-2-linan666@huaweicloud.com
2023-06-13 15:13:20 -07:00
Arnd Bergmann
3de13550a2 raid6: neon: add missing prototypes
The raid6 syndrome functions are generated for different sizes and have
no generic prototype, while in the inner functions have a prototype
in a header that cannot be included from the correct file. In both
cases, the compiler warns about missing prototypes:

lib/raid6/recov_neon_inner.c:27:6: warning: no previous prototype for '__raid6_2data_recov_neon' [-Wmissing-prototypes]
lib/raid6/recov_neon_inner.c:77:6: warning: no previous prototype for '__raid6_datap_recov_neon' [-Wmissing-prototypes]
lib/raid6/neon1.c:56:6: warning: no previous prototype for 'raid6_neon1_gen_syndrome_real' [-Wmissing-prototypes]
lib/raid6/neon1.c:86:6: warning: no previous prototype for 'raid6_neon1_xor_syndrome_real' [-Wmissing-prototypes]
lib/raid6/neon2.c:56:6: warning: no previous prototype for 'raid6_neon2_gen_syndrome_real' [-Wmissing-prototypes]
lib/raid6/neon2.c:97:6: warning: no previous prototype for 'raid6_neon2_xor_syndrome_real' [-Wmissing-prototypes]
lib/raid6/neon4.c:56:6: warning: no previous prototype for 'raid6_neon4_gen_syndrome_real' [-Wmissing-prototypes]
lib/raid6/neon4.c:119:6: warning: no previous prototype for 'raid6_neon4_xor_syndrome_real' [-Wmissing-prototypes]
lib/raid6/neon8.c:56:6: warning: no previous prototype for 'raid6_neon8_gen_syndrome_real' [-Wmissing-prototypes]
lib/raid6/neon8.c:163:6: warning: no previous prototype for 'raid6_neon8_xor_syndrome_real' [-Wmissing-prototypes]

Add a new header file that contains the prototypes for both to avoid
the warnings.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230517132220.937200-1-arnd@kernel.org
2023-06-13 15:13:20 -07:00
Christoph Hellwig
3dbd53c7be swim3: fix the floppy_locked_ioctl prototype
Add back the accidentally dropped mode parameter.

Fixes: b60f7635788a ("swim3: fix the floppy_locked_ioctl prototype")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230613154309.327557-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-13 09:45:40 -06:00
Keith Busch
c917dd96fe nvme: skip optional id ctrl csi if it failed
A frequently recieved report is the driver requests the optional Command
Set Specific Identify Controller structure. Some controllers report this
in their error log, which tiggers other warnings to user space
monitoring the devices.

These error entries are harmless and of questionable value to save in
the log, but let's reduce their occurance by not resending the command
if it previously failed. This will not prevent the errors on the initial
module load, but will greatly reduce their occurance on any rescans and
resumes from suspend.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=217445
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 18:24:15 -07:00
Irvin Cote
35e797b024 nvme-core: use nvme_ns_head_multipath instead of ns->head->disk
Change the way we check for a multipath nshead so as
to consistently use the same check to assert the same condition.

Signed-off-by: Irvin Cote <irvincoteg@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:44:57 -07:00
Daniel Wagner
d97b4111b3 nvmet-fcloop: Do not wait on completion when unregister fails
The nvme_fc_unregister_localport() returns an error code in case that
the locaport pointer is NULL or has already been unegisterd. localport is
is either in the ONLINE state (all resources allocated) or has already
been put into DELETED state.

In this case we will never receive an wakeup call and thus any caller
will hang, e.g. module unload.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:37:00 -07:00
Chaitanya Kulkarni
959ffef13b nvme-fabrics: open code __nvmf_host_find()
There is no point in maintaining a separate funciton __nvmf_host_find()
that has only one caller nvmf_host_add() especially when caller and
callee both are small enough to merge.

Due to this we are actually repeating the error handling code in both
callee and caller for no reason that can be avoided, but instead we have
to read both function to establish the correctness along with additional
lockdep warning check due to involved locking.

Just open code __nvmf_host_find() in nvme_host_alloc() with appropriate
comment that removes repeated error checks in the callee/caller and
lockdep check that is needed for the nvmf_hosts_mutex involvement,
diffstats :-

 drivers/nvme/host/fabrics.c | 75 +++++++++++++------------------------
 1 file changed, 27 insertions(+), 48 deletions(-)

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:37:00 -07:00
Chaitanya Kulkarni
900095bfbb nvme-fabrics: error out to unlock the mutex
Currently, in the nvmf_host_add() function, if the nvmf_host_alloc()
call failed to allocate memory for the host, the code would directly
return -ENOMEM without unlocking the nvmf_hosts_mutex. This could
lead to potential issues with mutex synchronization.

Fix that error handling mechanism by jumping to the out_unlock label
when nvmf_host_alloc() fails. This ensures that the mutex is unlocked
before returning the error code. The updated code enhances avoids
possible deadlocks.

Fixes: f0cebf82004d ("nvme-fabrics: prevent overriding of existing host")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Julia Lawall <julia.lawall@inria.fr>
Closes: https://lore.kernel.org/r/202306020909.MTUEBeIa-lkp@intel.com/
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Julia Lawall <julia.lawall@inria.fr>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:36:59 -07:00
Daniel Gomez
bdbfcd5f6c nvme: Increase block size variable size to 32-bit
Increase block size variable size to 32-bit unsigned to be able to
support block devices larger than 32k (starting from 64 KiB).

Physical and logical block size already support unsigned 32-bit.

Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:36:59 -07:00
Chaitanya Kulkarni
e9227d486e nvme-fcloop: no need to return from void function
Remove return at the end of void function.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:36:59 -07:00
Chaitanya Kulkarni
94c78ea124 nvmet-auth: remove unnecessary break after goto
Remove dead break after goto.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:36:59 -07:00
Christophe JAILLET
2ad0713c73 nvmet-auth: remove some dead code
'status' is known to be 0 at the point.
And nvmet_auth_challenge() return a -E<ERROR_CODE> or 0.
So these lines of code should just be removed.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:36:59 -07:00
Irvin Cote
2110a6bcd7 nvme-core: remove redundant check from nvme_init_ns_head
nvme_find_ns_head already checks that the list of namescpaces
in an already existing namespace head is not empty

Signed-off-by: Irvin Cote <irvincoteg@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:36:59 -07:00
Max Gurtovoy
942e21c042 nvme: move sysfs code to a dedicated sysfs.c file
The core.c file became long and hard to maintain. Create a dedicated
file to centralize the sysfs functionality. This is a common practice to
separate sysfs/configfs related logic from the main driver logic .c file.
For example, in the nvmet module the configfs interface has its own
dedicated file.

This patch does not include any functional changes.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
[merged dhchap memleak fixes, include nvme-auth.h]
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:36:59 -07:00
Max Gurtovoy
ae8bd606e0 nvme-fabrics: prevent overriding of existing host
When first connecting a target using the "default" host parameters,
setting the hostid from the command line during a subsequent connection
establishment would override the "default" hostid parameter. This would
cause an existing connection that is already using the host definitions
to lose its hostid.

To address this issue, the code has been modified to allow only 1:1
mapping between hostnqn and hostid. This will maintain unambiguous host
identification. Any non 1:1 mapping will be rejected during connection
establishment.

Tested-by: Noam Gottlieb <ngottlieb@nvidia.com>
Reviewed-by: Israel Rukshin <israelr@nvidia.com>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:36:58 -07:00
Max Gurtovoy
5e4b55fa52 nvme-fabrics: check hostid using uuid_equal
Use a dedicated function to match uuids instead of duplicating it.

Tested-by: Noam Gottlieb <ngottlieb@nvidia.com>
Reviewed-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:36:58 -07:00
Max Gurtovoy
b86d6595f7 nvme-fabrics: unify common code in admin and io queue connect
To simplify code maintenance, it is recommended to avoid duplicating
code.

Tested-by: Noam Gottlieb <ngottlieb@nvidia.com>
Reviewed-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:36:58 -07:00
Christophe JAILLET
92bbe55182 nvmet: reorder fields in 'struct nvmefc_fcp_req'
Group some variables based on their sizes to reduce holes.
On x86_64, this shrinks the size of 'struct nvmefc_fcp_req' from
112 to 104 bytes.

This structure is embedded in some other structures (nvme_fc_fcp_op
which itself is embedded in nvme_fcp_op_w_sgl), so it helps reducing the
size of these structures too.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:36:55 -07:00
Christophe JAILLET
0f5335e158 nvmet: reorder fields in 'struct nvme_dhchap_queue_context'
Group some variables based on their sizes to reduce holes.
On x86_64, this shrinks the size of 'struct nvme_dhchap_queue_context' from
416 to 400 bytes.

This structure is kvcalloc()'ed in nvme_auth_init_ctrl(), so it is likely
that the allocation can be relatively big. Saving 16 bytes per structure
may might a slight difference.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:36:51 -07:00
Christophe JAILLET
e64b0c807c nvmet: reorder fields in 'struct nvmf_ctrl_options'
Group some variables based on their sizes to reduce holes.
On x86_64, this shrinks the size of 'struct nvmf_ctrl_options' from 136 to
128 bytes.

When such a structure is allocated in nvmf_create_ctrl(), because of the
way memory allocation works, when 136 bytes were requested, 192 bytes were
allocated.

So this saves 64 bytes per allocation, 1 cache line to hold the whole
structure and a few cycles when zeroing the memory in nvmf_create_ctrl().

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:36:47 -07:00
Christophe JAILLET
9d217fb0e7 nvme: reorder fields in 'struct nvme_ctrl'
Group some variables based on their sizes to reduce holes.
On x86_64, this shrinks the size of 'struct nvme_ctrl' from 5368 to 5344
bytes when all CONFIG_* are defined.

This structure is embedded into some other structures, so it helps reducing
their size as well.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:36:42 -07:00
Christophe JAILLET
c60651e32f nvmet: reorder fields in 'struct nvmet_sq'
Group some variables based on their sizes to reduce holes.
On x86_64, this shrinks the size of 'struct nvmet_sq' from 472 to 464
bytes when CONFIG_NVME_TARGET_AUTH is defined.

This structure is embedded into some other structures, so it helps reducing
their sizes as well.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:36:37 -07:00
Keith Busch
a249d3066d nvme-fabrics: add queue setup helpers
tcp and rdma transports have lots of duplicate code setting up the
different queue mappings. Add common helpers.

Cc: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:33:03 -07:00
Irvin Cote
4a4d9bc0c8 nvme-pci: cleaning up nvme_pci_init_request
Erase the superfluous line that retrieves the nvme_dev.

Signed-off-by: Irvin Cote <irvincoteg@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:33:03 -07:00
Max Gurtovoy
f3f2837315 nvme-rdma: fix typo in comment
There is no ib_stop_cq API and the need for the +1 is for ib_drain_qp.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Israel Rukshin <israelr@nvidia.com>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:33:02 -07:00
Chaitanya Kulkarni
7ed5cf8e6d nvme-core: fix dev_pm_qos memleak
Call dev_pm_qos_hide_latency_tolerance() in the error unwind patch to
avoid following kmemleak:-

blktests (master) # kmemleak-clear; ./check nvme/044;
blktests (master) # kmemleak-scan ; kmemleak-show
nvme/044 (Test bi-directional authentication)                [passed]
    runtime  2.111s  ...  2.124s
unreferenced object 0xffff888110c46240 (size 96):
  comm "nvme", pid 33461, jiffies 4345365353 (age 75.586s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<0000000069ac2cec>] kmalloc_trace+0x25/0x90
    [<000000006acc66d5>] dev_pm_qos_update_user_latency_tolerance+0x6f/0x100
    [<00000000cc376ea7>] nvme_init_ctrl+0x38e/0x410 [nvme_core]
    [<000000007df61b4b>] 0xffffffffc05e88b3
    [<00000000d152b985>] 0xffffffffc05744cb
    [<00000000f04a4041>] vfs_write+0xc5/0x3c0
    [<00000000f9491baf>] ksys_write+0x5f/0xe0
    [<000000001c46513d>] do_syscall_64+0x3b/0x90
    [<00000000ecf348fe>] entry_SYSCALL_64_after_hwframe+0x72/0xdc

Link: https://lore.kernel.org/linux-nvme/CAHj4cs-nDaKzMx2txO4dbE+Mz9ePwLtU0e3egz+StmzOUgWUrA@mail.gmail.com/
Fixes: f50fff73d6 ("nvme: implement In-Band authentication")
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:33:02 -07:00
Chaitanya Kulkarni
3a12a0b868 nvme-core: add missing fault-injection cleanup
Add missing fault-injection cleanup in nvme_init_ctrl() in the error
unwind path that also fixes following message for blktests:-

linux-block (for-next) # grep debugfs debugfs-err.log
[  147.853464] debugfs: Directory 'nvme1' with parent '/' already present!
[  147.853973] nvme1: failed to create debugfs attr
[  148.802490] debugfs: Directory 'nvme1' with parent '/' already present!
[  148.803244] nvme1: failed to create debugfs attr
[  148.877304] debugfs: Directory 'nvme1' with parent '/' already present!
[  148.877775] nvme1: failed to create debugfs attr
[  149.816652] debugfs: Directory 'nvme1' with parent '/' already present!
[  149.818011] nvme1: failed to create debugfs attr

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:33:02 -07:00
Chaitanya Kulkarni
99c2dcc8ff nvme-core: fix memory leak in dhchap_ctrl_secret
Free dhchap_secret in nvme_ctrl_dhchap_ctrl_secret_store() before we
return when nvme_auth_generate_key() returns error.

Fixes: f50fff73d6 ("nvme: implement In-Band authentication")
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:33:02 -07:00
Chaitanya Kulkarni
a836ca33c5 nvme-core: fix memory leak in dhchap_secret_store
Free dhchap_secret in nvme_ctrl_dhchap_secret_store() before we return
fix following kmemleack:-

unreferenced object 0xffff8886376ea800 (size 64):
  comm "check", pid 22048, jiffies 4344316705 (age 92.199s)
  hex dump (first 32 bytes):
    44 48 48 43 2d 31 3a 30 30 3a 6e 78 72 35 4b 67  DHHC-1:00:nxr5Kg
    75 58 34 75 6f 41 78 73 4a 61 34 63 2f 68 75 4c  uX4uoAxsJa4c/huL
  backtrace:
    [<0000000030ce5d4b>] __kmalloc+0x4b/0x130
    [<000000009be1cdc1>] nvme_ctrl_dhchap_secret_store+0x8f/0x160 [nvme_core]
    [<00000000ac06c96a>] kernfs_fop_write_iter+0x12b/0x1c0
    [<00000000437e7ced>] vfs_write+0x2ba/0x3c0
    [<00000000f9491baf>] ksys_write+0x5f/0xe0
    [<000000001c46513d>] do_syscall_64+0x3b/0x90
    [<00000000ecf348fe>] entry_SYSCALL_64_after_hwframe+0x72/0xdc
unreferenced object 0xffff8886376eaf00 (size 64):
  comm "check", pid 22048, jiffies 4344316736 (age 92.168s)
  hex dump (first 32 bytes):
    44 48 48 43 2d 31 3a 30 30 3a 6e 78 72 35 4b 67  DHHC-1:00:nxr5Kg
    75 58 34 75 6f 41 78 73 4a 61 34 63 2f 68 75 4c  uX4uoAxsJa4c/huL
  backtrace:
    [<0000000030ce5d4b>] __kmalloc+0x4b/0x130
    [<000000009be1cdc1>] nvme_ctrl_dhchap_secret_store+0x8f/0x160 [nvme_core]
    [<00000000ac06c96a>] kernfs_fop_write_iter+0x12b/0x1c0
    [<00000000437e7ced>] vfs_write+0x2ba/0x3c0
    [<00000000f9491baf>] ksys_write+0x5f/0xe0
    [<000000001c46513d>] do_syscall_64+0x3b/0x90
    [<00000000ecf348fe>] entry_SYSCALL_64_after_hwframe+0x72/0xdc

Fixes: f50fff73d6 ("nvme: implement In-Band authentication")
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:33:02 -07:00