Commit Graph

5066 Commits

Author SHA1 Message Date
Mikulas Patocka
586d02c147 dm integrity: use kvfree for kvmalloc'd memory
commit fc8cec1139 upstream.

Use kvfree instead of kfree because the array is allocated with kvmalloc.

Fixes: 7eada909bf ("dm: add integrity target")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-16 10:10:27 +02:00
Tang Junhui
c4c9fd5589 bcache: return attach error when no cache set exist
[ Upstream commit 7f4fc93d47 ]

I attach a back-end device to a cache set, and the cache set is not
registered yet, this back-end device did not attach successfully, and no
error returned:
[root]# echo 87859280-fec6-4bcc-20df7ca8f86b > /sys/block/sde/bcache/attach
[root]#

In sysfs_attach(), the return value "v" is initialized to "size" in
the beginning, and if no cache set exist in bch_cache_sets, the "v" value
would not change any more, and return to sysfs, sysfs regard it as success
since the "size" is a positive number.

This patch fixes this issue by assigning "v" with "-ENOENT" in the
initialization.

Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-26 11:02:18 +02:00
Tang Junhui
4c8e0270dc bcache: fix for data collapse after re-attaching an attached device
[ Upstream commit 73ac105be3 ]

back-end device sdm has already attached a cache_set with ID
f67ebe1f-f8bc-4d73-bfe5-9dc88607f119, then try to attach with
another cache set, and it returns with an error:
[root]# cd /sys/block/sdm/bcache
[root]# echo 5ccd0a63-148e-48b8-afa2-aca9cbd6279f > attach
-bash: echo: write error: Invalid argument

After that, execute a command to modify the label of bcache
device:
[root]# echo data_disk1 > label

Then we reboot the system, when the system power on, the back-end
device can not attach to cache_set, a messages show in the log:
Feb  5 12:05:52 ceph152 kernel: [922385.508498] bcache:
bch_cached_dev_attach() couldn't find uuid for sdm in set

In sysfs_attach(), dc->sb.set_uuid was assigned to the value
which input through sysfs, no matter whether it is success
or not in bch_cached_dev_attach(). For example, If the back-end
device has already attached to an cache set, bch_cached_dev_attach()
would fail, but dc->sb.set_uuid was changed. Then modify the
label of bcache device, it will call bch_write_bdev_super(),
which would write the dc->sb.set_uuid to the super block, so we
record a wrong cache set ID in the super block, after the system
reboot, the cache set couldn't find the uuid of the back-end
device, so the bcache device couldn't exist and use any more.

In this patch, we don't assigned cache set ID to dc->sb.set_uuid
in sysfs_attach() directly, but input it into bch_cached_dev_attach(),
and assigned dc->sb.set_uuid to the cache set ID after the back-end
device attached to the cache set successful.

Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-26 11:02:18 +02:00
Tang Junhui
311e31419b bcache: fix for allocator and register thread race
[ Upstream commit 682811b3ce ]

After long time running of random small IO writing,
I reboot the machine, and after the machine power on,
I found bcache got stuck, the stack is:
[root@ceph153 ~]# cat /proc/2510/task/*/stack
[<ffffffffa06b2455>] closure_sync+0x25/0x90 [bcache]
[<ffffffffa06b6be8>] bch_journal+0x118/0x2b0 [bcache]
[<ffffffffa06b6dc7>] bch_journal_meta+0x47/0x70 [bcache]
[<ffffffffa06be8f7>] bch_prio_write+0x237/0x340 [bcache]
[<ffffffffa06a8018>] bch_allocator_thread+0x3c8/0x3d0 [bcache]
[<ffffffff810a631f>] kthread+0xcf/0xe0
[<ffffffff8164c318>] ret_from_fork+0x58/0x90
[<ffffffffffffffff>] 0xffffffffffffffff
[root@ceph153 ~]# cat /proc/2038/task/*/stack
[<ffffffffa06b1abd>] __bch_btree_map_nodes+0x12d/0x150 [bcache]
[<ffffffffa06b1bd1>] bch_btree_insert+0xf1/0x170 [bcache]
[<ffffffffa06b637f>] bch_journal_replay+0x13f/0x230 [bcache]
[<ffffffffa06c75fe>] run_cache_set+0x79a/0x7c2 [bcache]
[<ffffffffa06c0cf8>] register_bcache+0xd48/0x1310 [bcache]
[<ffffffff812f702f>] kobj_attr_store+0xf/0x20
[<ffffffff8125b216>] sysfs_write_file+0xc6/0x140
[<ffffffff811dfbfd>] vfs_write+0xbd/0x1e0
[<ffffffff811e069f>] SyS_write+0x7f/0xe0
[<ffffffff8164c3c9>] system_call_fastpath+0x16/0x1
The stack shows the register thread and allocator thread
were getting stuck when registering cache device.

I reboot the machine several times, the issue always
exsit in this machine.

I debug the code, and found the call trace as bellow:
register_bcache()
   ==>run_cache_set()
      ==>bch_journal_replay()
         ==>bch_btree_insert()
            ==>__bch_btree_map_nodes()
               ==>btree_insert_fn()
                  ==>btree_split() //node need split
                     ==>btree_check_reserve()
In btree_check_reserve(), It will check if there is enough buckets
of RESERVE_BTREE type, since allocator thread did not work yet, so
no buckets of RESERVE_BTREE type allocated, so the register thread
waits on c->btree_cache_wait, and goes to sleep.

Then the allocator thread initialized, the call trace is bellow:
bch_allocator_thread()
==>bch_prio_write()
   ==>bch_journal_meta()
      ==>bch_journal()
         ==>journal_wait_for_write()
In journal_wait_for_write(), It will check if journal is full by
journal_full(), but the long time random small IO writing
causes the exhaustion of journal buckets(journal.blocks_free=0),
In order to release the journal buckets,
the allocator calls btree_flush_write() to flush keys to
btree nodes, and waits on c->journal.wait until btree nodes writing
over or there has already some journal buckets space, then the
allocator thread goes to sleep. but in btree_flush_write(), since
bch_journal_replay() is not finished, so no btree nodes have journal
(condition "if (btree_current_write(b)->journal)" never satisfied),
so we got no btree node to flush, no journal bucket released,
and allocator sleep all the times.

Through the above analysis, we can see that:
1) Register thread wait for allocator thread to allocate buckets of
   RESERVE_BTREE type;
2) Alloctor thread wait for register thread to replay journal, so it
   can flush btree nodes and get journal bucket.
   then they are all got stuck by waiting for each other.

Hua Rui provided a patch for me, by allocating some buckets of
RESERVE_BTREE type in advance, so the register thread can get bucket
when btree node splitting and no need to waiting for the allocator
thread. I tested it, it has effect, and register thread run a step
forward, but finally are still got stuck, the reason is only 8 bucket
of RESERVE_BTREE type were allocated, and in bch_journal_replay(),
after 2 btree nodes splitting, only 4 bucket of RESERVE_BTREE type left,
then btree_check_reserve() is not satisfied anymore, so it goes to sleep
again, and in the same time, alloctor thread did not flush enough btree
nodes to release a journal bucket, so they all got stuck again.

So we need to allocate more buckets of RESERVE_BTREE type in advance,
but how much is enough?  By experience and test, I think it should be
as much as journal buckets. Then I modify the code as this patch,
and test in the machine, and it works.

This patch modified base on Hua Rui’s patch, and allocate more buckets
of RESERVE_BTREE type in advance to avoid register thread and allocate
thread going to wait for each other.

[patch v2] ca->sb.njournal_buckets would be 0 in the first time after
cache creation, and no journal exists, so just 8 btree buckets is OK.

Signed-off-by: Hua Rui <huarui.dev@gmail.com>
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-26 11:02:18 +02:00
Coly Li
f89edd17af bcache: properly set task state in bch_writeback_thread()
[ Upstream commit 99361bbf26 ]

Kernel thread routine bch_writeback_thread() has the following code block,

447         down_write(&dc->writeback_lock);
448~450     if (check conditions) {
451                 up_write(&dc->writeback_lock);
452                 set_current_state(TASK_INTERRUPTIBLE);
453
454                 if (kthread_should_stop())
455                         return 0;
456
457                 schedule();
458                 continue;
459         }

If condition check is true, its task state is set to TASK_INTERRUPTIBLE
and call schedule() to wait for others to wake up it.

There are 2 issues in current code,
1, Task state is set to TASK_INTERRUPTIBLE after the condition checks, if
   another process changes the condition and call wake_up_process(dc->
   writeback_thread), then at line 452 task state is set back to
   TASK_INTERRUPTIBLE, the writeback kernel thread will lose a chance to be
   waken up.
2, At line 454 if kthread_should_stop() is true, writeback kernel thread
   will return to kernel/kthread.c:kthread() with TASK_INTERRUPTIBLE and
   call do_exit(). It is not good to enter do_exit() with task state
   TASK_INTERRUPTIBLE, in following code path might_sleep() is called and a
   warning message is reported by __might_sleep(): "WARNING: do not call
   blocking ops when !TASK_RUNNING; state=1 set at [xxxx]".

For the first issue, task state should be set before condition checks.
Ineed because dc->writeback_lock is required when modifying all the
conditions, calling set_current_state() inside code block where dc->
writeback_lock is hold is safe. But this is quite implicit, so I still move
set_current_state() before all the condition checks.

For the second issue, frankley speaking it does not hurt when kernel thread
exits with TASK_INTERRUPTIBLE state, but this warning message scares users,
makes them feel there might be something risky with bcache and hurt their
data.  Setting task state to TASK_RUNNING before returning fixes this
problem.

In alloc.c:allocator_wait(), there is also a similar issue, and is also
fixed in this patch.

Changelog:
v3: merge two similar fixes into one patch
v2: fix the race issue in v1 patch.
v1: initial buggy fix.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Michael Lyle <mlyle@lyle.org>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-26 11:02:18 +02:00
Ming Lei
2e102fe86e dm mpath: return DM_MAPIO_REQUEUE on blk-mq rq allocation failure
[ Upstream commit 050af08ffb ]

blk-mq will rerun queue via RESTART or dispatch wake after one request
is completed, so not necessary to wait random time for requeuing, we
should trust blk-mq to do it.

More importantly, we need to return BLK_STS_RESOURCE to blk-mq so that
dequeuing from the I/O scheduler can be stopped, this results in
improved I/O merging.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-26 11:02:07 +02:00
Mikulas Patocka
79fbd052ea dm crypt: limit the number of allocated pages
commit 5059353df8 upstream.

dm-crypt consumes an excessive amount memory when the user attempts to
zero a dm-crypt device with "blkdiscard -z". The command "blkdiscard -z"
calls the BLKZEROOUT ioctl, it goes to the function __blkdev_issue_zeroout,
__blkdev_issue_zeroout sends a large amount of write bios that contain
the zero page as their payload.

For each incoming page, dm-crypt allocates another page that holds the
encrypted data, so when processing "blkdiscard -z", dm-crypt tries to
allocate the amount of memory that is equal to the size of the device.
This can trigger OOM killer or cause system crash.

Fix this by limiting the amount of memory that dm-crypt allocates to 2%
of total system memory. This limit is system-wide and is divided by the
number of active dm-crypt devices and each device receives an equal
share.

Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-24 09:36:31 +02:00
Tang Junhui
fad9bcb117 bcache: segregate flash only volume write streams
[ Upstream commit 4eca1cb28d ]

In such scenario that there are some flash only volumes
, and some cached devices, when many tasks request these devices in
writeback mode, the write IOs may fall to the same bucket as bellow:
| cached data | flash data | cached data | cached data| flash data|
then after writeback of these cached devices, the bucket would
be like bellow bucket:
| free | flash data | free | free | flash data |

So, there are many free space in this bucket, but since data of flash
only volumes still exists, so this bucket cannot be reclaimable,
which would cause waste of bucket space.

In this patch, we segregate flash only volume write streams from
cached devices, so data from flash only volumes and cached devices
can store in different buckets.

Compare to v1 patch, this patch do not add a additionally open bucket
list, and it is try best to segregate flash only volume write streams
from cached devices, sectors of flash only volumes may still be mixed
with dirty sectors of cached device, but the number is very small.

[mlyle: fixed commit log formatting, permissions, line endings]

Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-12 12:32:18 +02:00
Tang Junhui
ef60904109 bcache: stop writeback thread after detaching
[ Upstream commit 8d29c4426b ]

Currently, when a cached device detaching from cache, writeback thread is
not stopped, and writeback_rate_update work is not canceled. For example,
after the following command:
echo 1 >/sys/block/sdb/bcache/detach
you can still see the writeback thread. Then you attach the device to the
cache again, bcache will create another writeback thread, for example,
after below command:
echo  ba0fb5cd-658a-4533-9806-6ce166d883b9 > /sys/block/sdb/bcache/attach
then you will see 2 writeback threads.
This patch stops writeback thread and cancels writeback_rate_update work
when cached device detaching from cache.

Compare with patch v1, this v2 patch moves code down into the register
lock for safety in case of any future changes as Coly and Mike suggested.

[edit by mlyle: commit log spelling/formatting]

Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-12 12:32:18 +02:00
Rui Hua
71468ce63d bcache: ret IOERR when read meets metadata error
[ Upstream commit b221fc130c ]

The read request might meet error when searching the btree, but the error
was not handled in cache_lookup(), and this kind of metadata failure will
not go into cached_dev_read_error(), finally, the upper layer will receive
bi_status=0.  In this patch we judge the metadata error by the return
value of bch_btree_map_keys(), there are two potential paths give rise to
the error:

1. Because the btree is not totally cached in memery, we maybe get error
   when read btree node from cache device (see bch_btree_node_get()), the
   likely errno is -EIO, -ENOMEM

2. When read miss happens, bch_btree_insert_check_key() will be called to
   insert a "replace_key" to btree(see cached_dev_cache_miss(), just for
   doing preparatory work before insert the missed data to cache device),
   a failure can also happen in this situation, the likely errno is
   -ENOMEM

bch_btree_map_keys() will return MAP_DONE in normal scenario, but we will
get either -EIO or -ENOMEM in above two cases. if this happened, we should
NOT recover data from backing device (when cache device is dirty) because
we don't know whether bkeys the read request covered are all clean.  And
after that happened, s->iop.status is still its initially value(0) before
we submit s->bio.bio, we set it to BLK_STS_IOERR, so it can go into
cached_dev_read_error(), and finally it can be passed to upper layer, or
recovered by reread from backing device.

[edit by mlyle: patch formatting, word-wrap, comment spelling,
commit log format]

Signed-off-by: Hua Rui <huarui.dev@gmail.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-12 12:32:18 +02:00
Heinz Mauelshagen
8834a75d8a dm raid: fix raid set size revalidation
[ Upstream commit 61e06e2c3e ]

The raid set size is being revalidated unconditionally before a
reshaping conversion is started.  MD requires the size to only be
reduced in case of a stripe removing (i.e. shrinking) reshape but not
when growing because the raid array has to stay small until after the
growing reshape finishes.

Fix by avoiding the size revalidation in preresume unless a shrinking
reshape is requested.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-19 08:42:54 +01:00
Steffen Maier
d916e45394 dm mpath: fix passing integrity data
commit 8c5c147339 upstream.

After v4.12 commit e2460f2a4b ("dm: mark targets that pass integrity
data"), dm-multipath, e.g. on DIF+DIX SCSI disk paths, does not support
block integrity any more. So add it to the whitelist.

This is also a pre-requisite to use block integrity with other dm layer(s)
on top of multipath, such as kpartx partitions (dm-linear) or LVM.

Also, bump target version to reflect this fix.

Fixes: e2460f2a4b ("dm: mark targets that pass integrity data")
Cc: <stable@vger.kernel.org> #4.12+
Bisected-by: Fedor Loshakov <loshakov@linux.vnet.ibm.com>
Signed-off-by: Steffen Maier <maier@linux.vnet.ibm.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-19 08:42:47 +01:00
Michael Lyle
c6a1c0caf2 bcache: don't attach backing with duplicate UUID
commit 86755b7a96 upstream.

This can happen e.g. during disk cloning.

This is an incomplete fix: it does not catch duplicate UUIDs earlier
when things are still unattached.  It does not unregister the device.
Further changes to cope better with this are planned but conflict with
Coly's ongoing improvements to handling device errors.  In the meantime,
one can manually stop the device after this has happened.

Attempts to attach a duplicate device result in:

[  136.372404] loop: module loaded
[  136.424461] bcache: register_bdev() registered backing device loop0
[  136.424464] bcache: bch_cached_dev_attach() Tried to attach loop0 but duplicate UUID already attached

My test procedure is:

  dd if=/dev/sdb1 of=imgfile bs=1024 count=262144
  losetup -f imgfile

Signed-off-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-15 10:54:33 +01:00
Tang Junhui
14c2230b83 bcache: fix crashes in duplicate cache device register
commit cc40daf91b upstream.

Kernel crashed when register a duplicate cache device, the call trace is
bellow:
[  417.643790] CPU: 1 PID: 16886 Comm: bcache-register Tainted: G
   W  OE    4.15.5-amd64-preempt-sysrq-20171018 #2
[  417.643861] Hardware name: LENOVO 20ERCTO1WW/20ERCTO1WW, BIOS
N1DET41W (1.15 ) 12/31/2015
[  417.643870] RIP: 0010:bdevname+0x13/0x1e
[  417.643876] RSP: 0018:ffffa3aa9138fd38 EFLAGS: 00010282
[  417.643884] RAX: 0000000000000000 RBX: ffff8c8f2f2f8000 RCX: ffffd6701f8
c7edf
[  417.643890] RDX: ffffa3aa9138fd88 RSI: ffffa3aa9138fd88 RDI: 00000000000
00000
[  417.643895] RBP: ffffa3aa9138fde0 R08: ffffa3aa9138fae8 R09: 00000000000
1850e
[  417.643901] R10: ffff8c8eed34b271 R11: ffff8c8eed34b250 R12: 00000000000
00000
[  417.643906] R13: ffffd6701f78f940 R14: ffff8c8f38f80000 R15: ffff8c8ea7d
90000
[  417.643913] FS:  00007fde7e66f500(0000) GS:ffff8c8f61440000(0000) knlGS:
0000000000000000
[  417.643919] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  417.643925] CR2: 0000000000000314 CR3: 00000007e6fa0001 CR4: 00000000003
606e0
[  417.643931] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 00000000000
00000
[  417.643938] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 00000000000
00400
[  417.643946] Call Trace:
[  417.643978]  register_bcache+0x1117/0x1270 [bcache]
[  417.643994]  ? slab_pre_alloc_hook+0x15/0x3c
[  417.644001]  ? slab_post_alloc_hook.isra.44+0xa/0x1a
[  417.644013]  ? kernfs_fop_write+0xf6/0x138
[  417.644020]  kernfs_fop_write+0xf6/0x138
[  417.644031]  __vfs_write+0x31/0xcc
[  417.644043]  ? current_kernel_time64+0x10/0x36
[  417.644115]  ? __audit_syscall_entry+0xbf/0xe3
[  417.644124]  vfs_write+0xa5/0xe2
[  417.644133]  SyS_write+0x5c/0x9f
[  417.644144]  do_syscall_64+0x72/0x81
[  417.644161]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  417.644169] RIP: 0033:0x7fde7e1c1974
[  417.644175] RSP: 002b:00007fff13009a38 EFLAGS: 00000246 ORIG_RAX: 0000000
000000001
[  417.644183] RAX: ffffffffffffffda RBX: 0000000001658280 RCX: 00007fde7e1c
1974
[  417.644188] RDX: 000000000000000a RSI: 0000000001658280 RDI: 000000000000
0001
[  417.644193] RBP: 000000000000000a R08: 0000000000000003 R09: 000000000000
0077
[  417.644198] R10: 000000000000089e R11: 0000000000000246 R12: 000000000000
0001
[  417.644203] R13: 000000000000000a R14: 7fffffffffffffff R15: 000000000000
0000
[  417.644213] Code: c7 c2 83 6f ee 98 be 20 00 00 00 48 89 df e8 6c 27 3b 0
0 48 89 d8 5b c3 0f 1f 44 00 00 48 8b 47 70 48 89 f2 48 8b bf 80 00 00 00 <8
b> b0 14 03 00 00 e9 73 ff ff ff 0f 1f 44 00 00 48 8b 47 40 39
[  417.644302] RIP: bdevname+0x13/0x1e RSP: ffffa3aa9138fd38
[  417.644306] CR2: 0000000000000314

When registering duplicate cache device in register_cache(), after failure
on calling register_cache_set(), bch_cache_release() will be called, then
bdev will be freed, so bdevname(bdev, name) caused kernel crash.

Since bch_cache_release() will free bdev, so in this patch we make sure
bdev being freed if register_cache() fail, and do not free bdev again in
register_bcache() when register_cache() fail.

Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reported-by: Marc MERLIN <marc@merlins.org>
Tested-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-15 10:54:33 +01:00
Arnd Bergmann
525673339f dm bufio: avoid false-positive Wmaybe-uninitialized warning
commit 590347e400 upstream.

gcc-6.3 and earlier show a new warning after a seemingly unrelated
change to the arm64 PAGE_KERNEL definition:

In file included from drivers/md/dm-bufio.c:14:0:
drivers/md/dm-bufio.c: In function 'alloc_buffer':
include/linux/sched/mm.h:182:56: warning: 'noio_flag' may be used uninitialized in this function [-Wmaybe-uninitialized]
  current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
                                                        ^

The same warning happened earlier on linux-3.18 for MIPS and I did a
workaround for that, but now it's come back.

gcc-7 and newer are apparently smart enough to figure this out, and
other architectures don't show it, so the best I could come up with is
to rework the caller slightly in a way that makes it obvious enough to
all arm64 compilers what is happening here.

Fixes: 41acec6240 ("arm64: kpti: Make use of nG dependent on arm64_kernel_unmapped_at_el0()")
Link: https://patchwork.kernel.org/patch/9692829/
Cc: stable@vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
[snitzer: moved declarations inside conditional, altered vmalloc return]
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-15 10:54:33 +01:00
NeilBrown
9474d8fa7a md: only allow remove_and_add_spares when no sync_thread running.
commit 39772f0a7b upstream.

The locking protocols in md assume that a device will
never be removed from an array during resync/recovery/reshape.
When that isn't happening, rcu or reconfig_mutex is needed
to protect an rdev pointer while taking a refcount.  When
it is happening, that protection isn't needed.

Unfortunately there are cases were remove_and_add_spares() is
called when recovery might be happening: is state_store(),
slot_store() and hot_remove_disk().
In each case, this is just an optimization, to try to expedite
removal from the personality so the device can be removed from
the array.  If resync etc is happening, we just have to wait
for md_check_recover to find a suitable time to call
remove_and_add_spares().

This optimization and not essential so it doesn't
matter if it fails.
So change remove_and_add_spares() to abort early if
resync/recovery/reshape is happening, unless it is called
from md_check_recovery() as part of a newly started recovery.
The parameter "this" is only NULL when called from
md_check_recovery() so when it is NULL, there is no need to abort.

As this can result in a NULL dereference, the fix is suitable
for -stable.

cc: yuyufen <yuyufen@huawei.com>
Cc: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Fixes: 8430e7e0af ("md: disconnect device from personality before trying to remove it.")
Cc: stable@ver.kernel.org (v4.8+)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-08 22:41:26 -08:00
Shaohua Li
567c1f767e md/raid1/10: add missed blk plug
[ Upstream commit 18022a1bd3 ]

flush_pending_writes isn't always called with block plug, so add it, and plug
works in nested way.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-25 11:07:57 +01:00
bingjingc
559be170a4 md/raid5: correct degraded calculation in raid5_error
[ Upstream commit aff69d89bd ]

When disk failure occurs on new disks for reshape, mddev->degraded
is not calculated correctly. Faulty bit of the failure device is not
set before raid5_calc_degraded(conf).

mdadm --create /dev/md0 --level=5 --raid-devices=3 /dev/loop[012]
mdadm /dev/md0 -a /dev/loop3
mdadm /dev/md0 --grow -n4
mdadm /dev/md0 -f /dev/loop3 # simulating disk failure

cat /sys/block/md0/md/degraded # it outputs 0, but it should be 1.

However, mdadm -D /dev/md0 will show that it is degraded. It's a bug.
It can be fixed by moving the resources raid5_calc_degraded() depends
on before it.

Reported-by: Roy Chung <roychung@synology.com>
Reviewed-by: Alex Wu <alexwu@synology.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-25 11:07:57 +01:00
NeilBrown
f6318abd3a dm: correctly handle chained bios in dec_pending()
commit 8dd601fa83 upstream.

dec_pending() is given an error status (possibly 0) to be recorded
against a bio.  It can be called several times on the one 'struct
dm_io', and it is careful to only assign a non-zero error to
io->status.  However when it then assigned io->status to bio->bi_status,
it is not careful and could overwrite a genuine error status with 0.

This can happen when chained bios are in use.  If a bio is chained
beneath the bio that this dm_io is handling, the child bio might
complete and set bio->bi_status before the dm_io completes.

This has been possible since chained bios were introduced in 3.14, and
has become a lot easier to trigger with commit 18a25da843 ("dm: ensure
bio submission follows a depth-first tree walk") as that commit caused
dm to start using chained bios itself.

A particular failure mode is that if a bio spans an 'error' target and a
working target, the 'error' fragment will complete instantly and set the
->bi_status, and the other fragment will normally complete a little
later, and will clear ->bi_status.

The fix is simply to only assign io_error to bio->bi_status when
io_error is not zero.

Reported-and-tested-by: Milan Broz <gmazyland@gmail.com>
Cc: stable@vger.kernel.org (v3.14+)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22 15:42:29 +01:00
Michael Lyle
f9f175778b bcache: check return value of register_shrinker
[ Upstream commit 6c4ca1e36c ]

register_shrinker is now __must_check, so check it to kill a warning.
Caller of bch_btree_cache_alloc in super.c appropriately checks return
value so this is fully plumbed through.

This V2 fixes checkpatch warnings and improves the commit description,
as I was too hasty getting the previous version out.

Signed-off-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Vojtech Pavlik <vojtech@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-03 17:39:01 +01:00
Wei Yongjun
f05431489f dm crypt: fix error return code in crypt_ctr()
commit 3cc2e57c4b upstream.

Fix to return error code -ENOMEM from the mempool_create_kmalloc_pool()
error handling case instead of 0, as done elsewhere in this function.

Fixes: ef43aa3806 ("dm crypt: add cryptographic data integrity protection (authenticated encryption)")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-23 19:58:19 +01:00
Ondrej Kozina
82e12f1b26 dm crypt: wipe kernel key copy after IV initialization
commit dc94902bde upstream.

Loading key via kernel keyring service erases the internal
key copy immediately after we pass it in crypto layer. This is
wrong because IV is initialized later and we use wrong key
for the initialization (instead of real key there's just zeroed
block).

The bug may cause data corruption if key is loaded via kernel keyring
service first and later same crypt device is reactivated using exactly
same key in hexbyte representation, or vice versa. The bug (and fix)
affects only ciphers using following IVs: essiv, lmk and tcw.

Fixes: c538f6ec9f ("dm crypt: add ability to use keys from the kernel key retention service")
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-23 19:58:19 +01:00
Milan Broz
6b8fdeadeb dm crypt: fix crash by adding missing check for auth key size
commit 27c7003697 upstream.

If dm-crypt uses authenticated mode with separate MAC, there are two
concatenated part of the key structure - key(s) for encryption and
authentication key.

Add a missing check for authenticated key length.  If this key length is
smaller than actually provided key, dm-crypt now properly fails instead
of crashing.

Fixes: ef43aa3806 ("dm crypt: add cryptographic data integrity protection (authenticated encryption)")
Reported-by: Salah Coronya <salahx@yahoo.com>
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-23 19:58:19 +01:00
Mikulas Patocka
c808f1db59 dm integrity: don't store cipher request on the stack
commit 717f4b1c52 upstream.

Some asynchronous cipher implementations may use DMA.  The stack may
be mapped in the vmalloc area that doesn't support DMA.  Therefore,
the cipher request and initialization vector shouldn't be on the
stack.

Fix this by allocating the request and iv with kmalloc.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-23 19:58:19 +01:00
Dennis Yang
5f12c33849 dm thin metadata: THIN_MAX_CONCURRENT_LOCKS should be 6
commit 490ae017f5 upstream.

For btree removal, there is a corner case that a single thread
could takes 6 locks which is more than THIN_MAX_CONCURRENT_LOCKS(5)
and leads to deadlock.

A btree removal might eventually call
rebalance_children()->rebalance3() to rebalance entries of three
neighbor child nodes when shadow_spine has already acquired two
write locks. In rebalance3(), it tries to shadow and acquire the
write locks of all three child nodes. However, shadowing a child
node requires acquiring a read lock of the original child node and
a write lock of the new block. Although the read lock will be
released after block shadowing, shadowing the third child node
in rebalance3() could still take the sixth lock.
(2 write locks for shadow_spine +
 2 write locks for the first two child nodes's shadow +
 1 write lock for the last child node's shadow +
 1 read lock for the last child node)

Signed-off-by: Dennis Yang <dennisyang@qnap.com>
Acked-by: Joe Thornber <thornber@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-23 19:58:18 +01:00
Joe Thornber
905637024f dm btree: fix serious bug in btree_split_beneath()
commit bc68d0a435 upstream.

When inserting a new key/value pair into a btree we walk down the spine of
btree nodes performing the following 2 operations:

  i) space for a new entry
  ii) adjusting the first key entry if the new key is lower than any in the node.

If the _root_ node is full, the function btree_split_beneath() allocates 2 new
nodes, and redistibutes the root nodes entries between them.  The root node is
left with 2 entries corresponding to the 2 new nodes.

btree_split_beneath() then adjusts the spine to point to one of the two new
children.  This means the first key is never adjusted if the new key was lower,
ie. operation (ii) gets missed out.  This can result in the new key being
'lost' for a period; until another low valued key is inserted that will uncover
it.

This is a serious bug, and quite hard to make trigger in normal use.  A
reproducing test case ("thin create devices-in-reverse-order") is
available as part of the thin-provision-tools project:
  https://github.com/jthornber/thin-provisioning-tools/blob/master/functional-tests/device-mapper/dm-tests.scm#L593

Fix the issue by changing btree_split_beneath() so it no longer adjusts
the spine.  Instead it unlocks both the new nodes, and lets the main
loop in btree_insert_raw() relock the appropriate one and make any
neccessary adjustments.

Reported-by: Monty Pavel <monty_pavel@sina.com>
Signed-off-by: Joe Thornber <thornber@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-23 19:58:18 +01:00
Suren Baghdasaryan
883a082e8b dm bufio: fix shrinker scans when (nr_to_scan < retain_target)
commit fbc7c07ec2 upstream.

When system is under memory pressure it is observed that dm bufio
shrinker often reclaims only one buffer per scan. This change fixes
the following two issues in dm bufio shrinker that cause this behavior:

1. ((nr_to_scan - freed) <= retain_target) condition is used to
terminate slab scan process. This assumes that nr_to_scan is equal
to the LRU size, which might not be correct because do_shrink_slab()
in vmscan.c calculates nr_to_scan using multiple inputs.
As a result when nr_to_scan is less than retain_target (64) the scan
will terminate after the first iteration, effectively reclaiming one
buffer per scan and making scans very inefficient. This hurts vmscan
performance especially because mutex is acquired/released every time
dm_bufio_shrink_scan() is called.
New implementation uses ((LRU size - freed) <= retain_target)
condition for scan termination. LRU size can be safely determined
inside __scan() because this function is called after dm_bufio_lock().

2. do_shrink_slab() uses value returned by dm_bufio_shrink_count() to
determine number of freeable objects in the slab. However dm_bufio
always retains retain_target buffers in its LRU and will terminate
a scan when this mark is reached. Therefore returning the entire LRU size
from dm_bufio_shrink_count() is misleading because that does not
represent the number of freeable objects that slab will reclaim during
a scan. Returning (LRU size - retain_target) better represents the
number of freeable objects in the slab. This way do_shrink_slab()
returns 0 when (LRU size < retain_target) and vmscan will not try to
scan this shrinker avoiding scans that will not reclaim any memory.

Test: tested using Android device running
<AOSP>/system/extras/alloc-stress that generates memory pressure
and causes intensive shrinker scans

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-17 09:45:17 +01:00
Guoqing Jiang
6d7bdad132 md: always set THREAD_WAKEUP and wake up wqueue if thread existed
[ Upstream commit d1d90147c9 ]

Since commit 4ad23a9764 ("MD: use per-cpu counter for writes_pending"),
the wait_queue is only got invoked if THREAD_WAKEUP is not set previously.

With above change, I can see process_metadata_update could always hang on
the wait queue, because mddev->thread could stay on 'D' status and the
THREAD_WAKEUP flag is not cleared since there are lots of place to wake up
mddev->thread. Then deadlock happened as follows:

linux175:~ # ps aux|grep md|grep D
root    20117   0.0 0.0         0   0 ? D   03:45   0:00 [md0_raid1]
root    20125   0.0 0.0         0   0 ? D   03:45   0:00 [md0_cluster_rec]
linux175:~ # cat /proc/20117/stack
[<ffffffffa0635604>] dlm_lock_sync+0x94/0xd0 [md_cluster]
[<ffffffffa0635674>] lock_token+0x34/0xd0 [md_cluster]
[<ffffffffa0635804>] metadata_update_start+0x64/0x110 [md_cluster]
[<ffffffffa04d985b>] md_update_sb.part.58+0x9b/0x860 [md_mod]
[<ffffffffa04da035>] md_update_sb+0x15/0x30 [md_mod]
[<ffffffffa04dc066>] md_check_recovery+0x266/0x490 [md_mod]
[<ffffffffa06450e2>] raid1d+0x42/0x810 [raid1]
[<ffffffffa04d2252>] md_thread+0x122/0x150 [md_mod]
[<ffffffff81091741>] kthread+0x101/0x140
linux175:~ # cat /proc/20125/stack
[<ffffffffa0636679>] recv_daemon+0x3f9/0x5c0 [md_cluster]
[<ffffffffa04d2252>] md_thread+0x122/0x150 [md_mod]
[<ffffffff81091741>] kthread+0x101/0x140

So let's revert the part of code in the commit to resovle the problem since
we can't get lots of benefits of previous change.

Fixes: 4ad23a9764 ("MD: use per-cpu counter for writes_pending")
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-25 14:26:26 +01:00
Will Deacon
5383f45db3 locking/barriers: Convert users of lockless_dereference() to READ_ONCE()
commit 3382290ed2 upstream.

[ Note, this is a Git cherry-pick of the following commit:

    506458efaf ("locking/barriers: Convert users of lockless_dereference() to READ_ONCE()")

  ... for easier x86 PTI code testing and back-porting. ]

READ_ONCE() now has an implicit smp_read_barrier_depends() call, so it
can be used instead of lockless_dereference() without any change in
semantics.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1508840570-22169-4-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-25 14:26:21 +01:00
Guoqing Jiang
46788d19e1 md-cluster: fix wrong condition check in raid1_write_request
[ Upstream commit 385f4d7f94 ]

The check used here is to avoid conflict between write and
resync, however we used the wrong logic, it should be the
inverse of the checking inside "if".

Fixes: 589a1c4 ("Suspend writes in RAID1 if within range")
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20 10:10:36 +01:00
Artur Paszkiewicz
339aba6798 raid5-ppl: check recovery_offset when performing ppl recovery
[ Upstream commit 07719ff767 ]

If starting an array that is undergoing rebuild, make ppl recovery honor
the recovery_offset of a member disk and don't read data that is not yet
in-sync.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20 10:10:36 +01:00
NeilBrown
8151940acc raid5: Set R5_Expanded on parity devices as well as data.
[ Upstream commit 235b6003fb ]

When reshaping a fully degraded raid5/raid6 to a larger
nubmer of devices, the new device(s) are not in-sync
and so that can make the newly grown stripe appear to be
"failed".
To avoid this, we set the R5_Expanded flag to say "Even though
this device is not fully in-sync, this block is safe so
don't treat the device as failed for this stripe".
This flag is set for data devices, not not for parity devices.

Consequently, if you have a RAID6 with two devices that are partly
recovered and a spare, and start a reshape to include the spare,
then when the reshape gets past the point where the recovery was
up to, it will think the stripes are failed and will get into
an infinite loop, failing to make progress.

So when contructing parity on an EXPAND_READY stripe,
set R5_Expanded.

Reported-by: Curt <lightspd@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20 10:10:34 +01:00
tang.junhui
d5860344c2 bcache: fix wrong cache_misses statistics
[ Upstream commit c157313791 ]

Currently, Cache missed IOs are identified by s->cache_miss, but actually,
there are many situations that missed IOs are not assigned a value for
s->cache_miss in cached_dev_cache_miss(), for example, a bypassed IO
(s->iop.bypass = 1), or the cache_bio allocate failed. In these situations,
it will go to out_put or out_submit, and s->cache_miss is null, which leads
bch_mark_cache_accounting() to treat this IO as a hit IO.

[ML: applied by 3-way merge]

Signed-off-by: tang.junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20 10:10:29 +01:00
Liang Chen
c9973b3d35 bcache: explicitly destroy mutex while exiting
[ Upstream commit 330a4db89d ]

mutex_destroy does nothing most of time, but it's better to call
it to make the code future proof and it also has some meaning
for like mutex debug.

As Coly pointed out in a previous review, bcache_exit() may not be
able to handle all the references properly if userspace registers
cache and backing devices right before bch_debug_init runs and
bch_debug_init failes later. So not exposing userspace interface
until everything is ready to avoid that issue.

Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Coly Li <colyli@suse.de>
Reviewed-by: Eric Wheeler <bcache@linux.ewheeler.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20 10:10:29 +01:00
monty_pavel@sina.com
fbce429b41 dm: fix various targets to dm_register_target after module __init resources created
commit 7e6358d244 upstream.

A NULL pointer is seen if two concurrent "vgchange -ay -K <vg name>"
processes race to load the dm-thin-pool module:

 PID: 25992 TASK: ffff883cd7d23500 CPU: 4 COMMAND: "vgchange"
  #0 [ffff883cd743d600] machine_kexec at ffffffff81038fa9
  0000001 [ffff883cd743d660] crash_kexec at ffffffff810c5992
  0000002 [ffff883cd743d730] oops_end at ffffffff81515c90
  0000003 [ffff883cd743d760] no_context at ffffffff81049f1b
  0000004 [ffff883cd743d7b0] __bad_area_nosemaphore at ffffffff8104a1a5
  0000005 [ffff883cd743d800] bad_area at ffffffff8104a2ce
  0000006 [ffff883cd743d830] __do_page_fault at ffffffff8104aa6f
  0000007 [ffff883cd743d950] do_page_fault at ffffffff81517bae
  0000008 [ffff883cd743d980] page_fault at ffffffff81514f95
     [exception RIP: kmem_cache_alloc+108]
     RIP: ffffffff8116ef3c RSP: ffff883cd743da38 RFLAGS: 00010046
     RAX: 0000000000000004 RBX: ffffffff81121b90 RCX: ffff881bf1e78cc0
     RDX: 0000000000000000 RSI: 00000000000000d0 RDI: 0000000000000000
     RBP: ffff883cd743da68 R8: ffff881bf1a4eb00 R9: 0000000080042000
     R10: 0000000000002000 R11: 0000000000000000 R12: 00000000000000d0
     R13: 0000000000000000 R14: 00000000000000d0 R15: 0000000000000246
     ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
  0000009 [ffff883cd743da70] mempool_alloc_slab at ffffffff81121ba5
 0000010 [ffff883cd743da80] mempool_create_node at ffffffff81122083
 0000011 [ffff883cd743dad0] mempool_create at ffffffff811220f4
 0000012 [ffff883cd743dae0] pool_ctr at ffffffffa08de049 [dm_thin_pool]
 0000013 [ffff883cd743dbd0] dm_table_add_target at ffffffffa0005f2f [dm_mod]
 0000014 [ffff883cd743dc30] table_load at ffffffffa0008ba9 [dm_mod]
 0000015 [ffff883cd743dc90] ctl_ioctl at ffffffffa0009dc4 [dm_mod]

The race results in a NULL pointer because:

Process A (vgchange -ay -K):
 	a. send DM_LIST_VERSIONS_CMD ioctl;
 	b. pool_target not registered;
 	c. modprobe dm_thin_pool and wait until end.

Process B (vgchange -ay -K):
 	a. send DM_LIST_VERSIONS_CMD ioctl;
 	b. pool_target registered;
 	c. table_load->dm_table_add_target->pool_ctr;
 	d. _new_mapping_cache is NULL and panic.
Note:
 	1. process A and process B are two concurrent processes.
 	2. pool_target can be detected by process B but
 	_new_mapping_cache initialization has not ended.

To fix dm-thin-pool, and other targets (cache, multipath, and snapshot)
with the same problem, simply dm_register_target() after all resources
created during module init (as labelled with __init) are finished.

Signed-off-by: monty <monty_pavel@sina.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20 10:10:21 +01:00
Zdenek Kabelac
89a459e0d5 md: free unused memory after bitmap resize
[ Upstream commit 0868b99c21 ]

When bitmap is resized, the old kalloced chunks just are not released
once the resized bitmap starts to use new space.

This fixes in particular kmemleak reports like this one:

unreferenced object 0xffff8f4311e9c000 (size 4096):
  comm "lvm", pid 19333, jiffies 4295263268 (age 528.265s)
  hex dump (first 32 bytes):
    02 80 02 80 02 80 02 80 02 80 02 80 02 80 02 80  ................
    02 80 02 80 02 80 02 80 02 80 02 80 02 80 02 80  ................
  backtrace:
    [<ffffffffa69471ca>] kmemleak_alloc+0x4a/0xa0
    [<ffffffffa628c10e>] kmem_cache_alloc_trace+0x14e/0x2e0
    [<ffffffffa676cfec>] bitmap_checkpage+0x7c/0x110
    [<ffffffffa676d0c5>] bitmap_get_counter+0x45/0xd0
    [<ffffffffa676d6b3>] bitmap_set_memory_bits+0x43/0xe0
    [<ffffffffa676e41c>] bitmap_init_from_disk+0x23c/0x530
    [<ffffffffa676f1ae>] bitmap_load+0xbe/0x160
    [<ffffffffc04c47d3>] raid_preresume+0x203/0x2f0 [dm_raid]
    [<ffffffffa677762f>] dm_table_resume_targets+0x4f/0xe0
    [<ffffffffa6774b52>] dm_resume+0x122/0x140
    [<ffffffffa6779b9f>] dev_suspend+0x18f/0x290
    [<ffffffffa677a3a7>] ctl_ioctl+0x287/0x560
    [<ffffffffa677a693>] dm_ctl_ioctl+0x13/0x20
    [<ffffffffa62d6b46>] do_vfs_ioctl+0xa6/0x750
    [<ffffffffa62d7269>] SyS_ioctl+0x79/0x90
    [<ffffffffa6956d41>] entry_SYSCALL_64_fastpath+0x1f/0xc2

Signed-off-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-17 15:08:00 +01:00
Heinz Mauelshagen
2c727856d0 dm raid: fix panic when attempting to force a raid to sync
[ Upstream commit 2339784490 ]

Requesting a sync on an active raid device via a table reload
(see 'sync' parameter in Documentation/device-mapper/dm-raid.txt)
skips the super_load() call that defines the superblock size
(rdev->sb_size) -- resulting in an oops if/when super_sync()->memset()
is called.

Fix by moving the initialization of the superblock start and size
out of super_load() to the caller (analyse_superblocks).

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-17 15:08:00 +01:00
Song Liu
55b26ae24c md/r5cache: move mddev_lock() out of r5c_journal_mode_set()
commit ff35f58e8f upstream.

r5c_journal_mode_set() is called by r5c_journal_mode_store() and
raid_ctr() in dm-raid. We don't need mddev_lock() when calling from
raid_ctr(). This patch fixes this by moves the mddev_lock() to
r5c_journal_mode_store().

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-14 09:52:59 +01:00
NeilBrown
a67936842b md: forbid a RAID5 from having both a bitmap and a journal.
commit 230b55fa8d upstream.

Having both a bitmap and a journal is pointless.
Attempting to do so can corrupt the bitmap if the journal
replay happens before the bitmap is initialized.
Rather than try to avoid this corruption, simply
refuse to allow arrays with both a bitmap and a journal.
So:
 - if raid5_run sees both are present, fail.
 - if adding a bitmap finds a journal is present, fail
 - if adding a journal finds a bitmap is present, fail.

Signed-off-by: NeilBrown <neilb@suse.com>
Tested-by: Joshua Kinard <kumba@gentoo.org>
Acked-by: Joshua Kinard <kumba@gentoo.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-05 11:26:38 +01:00
Rui Hua
1cafc451a9 bcache: recover data from backing when data is clean
commit e393aa2446 upstream.

When we send a read request and hit the clean data in cache device, there
is a situation called cache read race in bcache(see the commit in the tail
of cache_look_up(), the following explaination just copy from there):
The bucket we're reading from might be reused while our bio is in flight,
and we could then end up reading the wrong data. We guard against this
by checking (in bch_cache_read_endio()) if the pointer is stale again;
if so, we treat it as an error (s->iop.error = -EINTR) and reread from
the backing device (but we don't pass that error up anywhere)

It should be noted that cache read race happened under normal
circumstances, not the circumstance when SSD failed, it was counted
and shown in  /sys/fs/bcache/XXX/internal/cache_read_races.

Without this patch, when we use writeback mode, we will never reread from
the backing device when cache read race happened, until the whole cache
device is clean, because the condition
(s->recoverable && (dc && !atomic_read(&dc->has_dirty))) is false in
cached_dev_read_error(). In this situation, the s->iop.error(= -EINTR)
will be passed up, at last, user will receive -EINTR when it's bio end,
this is not suitable, and wield to up-application.

In this patch, we use s->read_dirty_data to judge whether the read
request hit dirty data in cache device, it is safe to reread data from
the backing device when the read request hit clean data. This can not
only handle cache read race, but also recover data when failed read
request from cache device.

[edited by mlyle to fix up whitespace, commit log title, comment
spelling]

Fixes: d59b237959 ("bcache: only permit to recovery read error when cache device is clean")
Signed-off-by: Hua Rui <huarui.dev@gmail.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-05 11:26:34 +01:00
Coly Li
260c6625c8 bcache: only permit to recovery read error when cache device is clean
commit d59b237959 upstream.

When bcache does read I/Os, for example in writeback or writethrough mode,
if a read request on cache device is failed, bcache will try to recovery
the request by reading from cached device. If the data on cached device is
not synced with cache device, then requester will get a stale data.

For critical storage system like database, providing stale data from
recovery may result an application level data corruption, which is
unacceptible.

With this patch, for a failed read request in writeback or writethrough
mode, recovery a recoverable read request only happens when cache device
is clean. That is to say, all data on cached device is up to update.

For other cache modes in bcache, read request will never hit
cached_dev_read_error(), they don't need this patch.

Please note, because cache mode can be switched arbitrarily in run time, a
writethrough mode might be switched from a writeback mode. Therefore
checking dc->has_data in writethrough mode still makes sense.

Changelog:
V4: Fix parens error pointed by Michael Lyle.
v3: By response from Kent Oversteet, he thinks recovering stale data is a
    bug to fix, and option to permit it is unnecessary. So this version
    the sysfs file is removed.
v2: rename sysfs entry from allow_stale_data_on_failure  to
    allow_stale_data_on_failure, and fix the confusing commit log.
v1: initial patch posted.

[small change to patch comment spelling by mlyle]

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Reported-by: Arne Wolf <awolf@lenovo.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Nix <nix@esperi.org.uk>
Cc: Kai Krakow <hurikhan77@gmail.com>
Cc: Eric Wheeler <bcache@lists.ewheeler.net>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-05 11:26:34 +01:00
Huacai Chen
f5b41b3662 bcache: Fix building error on MIPS
commit cf33c1ee52 upstream.

This patch try to fix the building error on MIPS. The reason is MIPS
has already defined the PTR macro, which conflicts with the PTR macro
in include/uapi/linux/bcache.h.

[fixed by mlyle: corrected a line-length issue]

Signed-off-by: Huacai Chen <chenhc@lemote.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-05 11:26:34 +01:00
Nate Dailey
d92105f93a raid1: prevent freeze_array/wait_all_barriers deadlock
commit f6eca2d43e upstream.

If freeze_array is attempted in the middle of close_sync/
wait_all_barriers, deadlock can occur.

freeze_array will wait for nr_pending and nr_queued to line up.
wait_all_barriers increments nr_pending for each barrier bucket, one
at a time, but doesn't actually issue IO that could be counted in
nr_queued. So freeze_array is blocked until wait_all_barriers
completes and allow_all_barriers runs. At the same time, when
_wait_barrier sees array_frozen == 1, it stops and waits for
freeze_array to complete.

Prevent the deadlock by making close_sync call _wait_barrier and
_allow_barrier for one bucket at a time, instead of deferring the
_allow_barrier calls until after all _wait_barriers are complete.

Signed-off-by: Nate Dailey <nate.dailey@stratus.com>
Fix: fd76863e37fe(RAID1: a new I/O barrier implementation to remove resync window)
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-30 08:40:52 +00:00
Shaohua Li
a6ff2fb417 md/bitmap: revert a patch
commit 938b533d47 upstream.

This reverts commit 8031c3ddc7. That patches doesn't work well if PAGE_SIZE >
4k. We will fix the original problem with a different approach.

Fix: 8031c3ddc70a(md/bitmap: copy correct data for bitmap super)
Reported-by: Joshua Kinard <kumba@gentoo.org>
Suggested-by: Neil Brown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-30 08:40:47 +00:00
Artur Paszkiewicz
7cd7a7aa01 md: don't check MD_SB_CHANGE_CLEAN in md_allow_write
commit b90f6ff080 upstream.

Only MD_SB_CHANGE_PENDING should be used to wait for transition from
clean to dirty. Checking also MD_SB_CHANGE_CLEAN is unnecessary and can
race with e.g. md_do_sync(). This sporadically causes a hang when
changing consistency policy during resync:

INFO: task mdadm:6183 blocked for more than 30 seconds.
      Not tainted 4.14.0-rc3+ #391
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mdadm           D12752  6183   6022 0x00000000
Call Trace:
 __schedule+0x93f/0x990
 schedule+0x6b/0x90
 md_allow_write+0x100/0x130 [md_mod]
 ? do_wait_intr_irq+0x90/0x90
 resize_stripes+0x3a/0x5b0 [raid456]
 ? kernfs_fop_write+0xbe/0x180
 raid5_change_consistency_policy+0xa6/0x200 [raid456]
 consistency_policy_store+0x2e/0x70 [md_mod]
 md_attr_store+0x90/0xc0 [md_mod]
 sysfs_kf_write+0x42/0x50
 kernfs_fop_write+0x119/0x180
 __vfs_write+0x28/0x110
 ? rcu_sync_lockdep_assert+0x12/0x60
 ? __sb_start_write+0x15a/0x1c0
 ? vfs_write+0xa3/0x1a0
 vfs_write+0xb4/0x1a0
 SyS_write+0x49/0xa0
 entry_SYSCALL_64_fastpath+0x18/0xad

Fixes: 2214c260c7 ("md: don't return -EAGAIN in md_allow_write for external metadata arrays")
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-30 08:40:47 +00:00
NeilBrown
459aad50a5 md: fix deadlock error in recent patch.
commit d47c8ad261 upstream.

A recent patch aimed to cause md_write_start() to fail (rather than
block) when the mddev was suspending, so as to avoid deadlocks.
Unfortunately the test in wait_event() was wrong, and it didn't change
behaviour at all.

We wait_event() must wait until the metadata is written OR the array is
suspending.

Fixes: cc27b0c78c ("md: fix deadlock between mddev_suspend() and md_write_start()")
Reported-by: Xiao Ni <xni@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-30 08:40:47 +00:00
Coly Li
e9c80881b3 bcache: check ca->alloc_thread initialized before wake up it
commit 91af8300d9 upstream.

In bcache code, sysfs entries are created before all resources get
allocated, e.g. allocation thread of a cache set.

There is posibility for NULL pointer deference if a resource is accessed
but which is not initialized yet. Indeed Jorg Bornschein catches one on
cache set allocation thread and gets a kernel oops.

The reason for this bug is, when bch_bucket_alloc() is called during
cache set registration and attaching, ca->alloc_thread is not properly
allocated and initialized yet, call wake_up_process() on ca->alloc_thread
triggers NULL pointer deference failure. A simple and fast fix is, before
waking up ca->alloc_thread, checking whether it is allocated, and only
wake up ca->alloc_thread when it is not NULL.

Signed-off-by: Coly Li <colyli@suse.de>
Reported-by: Jorg Bornschein <jb@capsec.org>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-30 08:40:45 +00:00
Mike Snitzer
e39516d24f dm: discard support requires all targets in a table support discards
commit 8a74d29d54 upstream.

A DM device with a mix of discard capabilities (due to some underlying
devices not having discard support) _should_ just return -EOPNOTSUPP for
the region of the device that doesn't support discards (even if only by
way of the underlying driver formally not supporting discards).  BUT,
that does ask the underlying driver to handle something that it never
advertised support for.  In doing so we're exposing users to the
potential for a underlying disk driver hanging if/when a discard is
issued a the device that is incapable and never claimed to support
discards.

Fix this by requiring that each DM target in a DM table provide discard
support as a prereq for a DM device to advertise support for discards.

This may cause some configurations that were happily supporting discards
(even in the face of a mix of discard support) to stop supporting
discards -- but the risk of users hitting driver hangs, and forced
reboots, outweighs supporting those fringe mixed discard
configurations.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-30 08:40:43 +00:00
Hou Tao
3bfb87ecb4 dm: fix race between dm_get_from_kobject() and __dm_destroy()
commit b9a41d21dc upstream.

The following BUG_ON was hit when testing repeat creation and removal of
DM devices:

    kernel BUG at drivers/md/dm.c:2919!
    CPU: 7 PID: 750 Comm: systemd-udevd Not tainted 4.1.44
    Call Trace:
     [<ffffffff81649e8b>] dm_get_from_kobject+0x34/0x3a
     [<ffffffff81650ef1>] dm_attr_show+0x2b/0x5e
     [<ffffffff817b46d1>] ? mutex_lock+0x26/0x44
     [<ffffffff811df7f5>] sysfs_kf_seq_show+0x83/0xcf
     [<ffffffff811de257>] kernfs_seq_show+0x23/0x25
     [<ffffffff81199118>] seq_read+0x16f/0x325
     [<ffffffff811de994>] kernfs_fop_read+0x3a/0x13f
     [<ffffffff8117b625>] __vfs_read+0x26/0x9d
     [<ffffffff8130eb59>] ? security_file_permission+0x3c/0x44
     [<ffffffff8117bdb8>] ? rw_verify_area+0x83/0xd9
     [<ffffffff8117be9d>] vfs_read+0x8f/0xcf
     [<ffffffff81193e34>] ? __fdget_pos+0x12/0x41
     [<ffffffff8117c686>] SyS_read+0x4b/0x76
     [<ffffffff817b606e>] system_call_fastpath+0x12/0x71

The bug can be easily triggered, if an extra delay (e.g. 10ms) is added
between the test of DMF_FREEING & DMF_DELETING and dm_get() in
dm_get_from_kobject().

To fix it, we need to ensure the test of DMF_FREEING & DMF_DELETING and
dm_get() are done in an atomic way, so _minor_lock is used.

The other callers of dm_get() have also been checked to be OK: some
callers invoke dm_get() under _minor_lock, some callers invoke it under
_hash_lock, and dm_start_request() invoke it after increasing
md->open_count.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-30 08:40:43 +00:00
Mikulas Patocka
2bf483c9a4 dm: allocate struct mapped_device with kvzalloc
commit 856eb0916d upstream.

The structure srcu_struct can be very big, its size is proportional to the
value CONFIG_NR_CPUS. The Fedora kernel has CONFIG_NR_CPUS 8192, the field
io_barrier in the struct mapped_device has 84kB in the debugging kernel
and 50kB in the non-debugging kernel. The large size may result in failure
of the function kzalloc_node.

In order to avoid the allocation failure, we use the function
kvzalloc_node, this function falls back to vmalloc if a large contiguous
chunk of memory is not available. This patch also moves the field
io_barrier to the last position of struct mapped_device - the reason is
that on many processor architectures, short memory offsets result in
smaller code than long memory offsets - on x86-64 it reduces code size by
320 bytes.

Note to stable kernel maintainers - the kernels 4.11 and older don't have
the function kvzalloc_node, you can use the function vzalloc_node instead.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-30 08:40:43 +00:00