Commit Graph

79471 Commits

Author SHA1 Message Date
Alex Shi
2bf7955152 Merge tag 'v4.4.8' into linux-linaro-lsk-v4.4
This is the 4.4.8 stable release
2016-04-21 12:06:25 +08:00
Bjorn Helgaas
2d0d0011ff Revert "PCI: Add helpers to manage pci_dev->irq and pci_dev->irq_managed"
commit 67b4eab91c upstream.

Revert 811a4e6fce ("PCI: Add helpers to manage pci_dev->irq and
pci_dev->irq_managed").

This is part of reverting 991de2e590 ("PCI, x86: Implement
pcibios_alloc_irq() and pcibios_free_irq()") to fix regressions it
introduced.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=111211
Fixes: 991de2e590 ("PCI, x86: Implement pcibios_alloc_irq() and pcibios_free_irq()")
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
CC: Jiang Liu <jiang.liu@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-20 15:42:16 +09:00
Miklos Szeredi
c452dfc332 fs: add file_dentry()
commit d101a12595 upstream.

This series fixes bugs in nfs and ext4 due to 4bacc9c923 ("overlayfs:
Make f_path always point to the overlay and f_inode to the underlay").

Regular files opened on overlayfs will result in the file being opened on
the underlying filesystem, while f_path points to the overlayfs
mount/dentry.

This confuses filesystems which get the dentry from struct file and assume
it's theirs.

Add a new helper, file_dentry() [*], to get the filesystem's own dentry
from the file.  This checks file->f_path.dentry->d_flags against
DCACHE_OP_REAL, and returns file->f_path.dentry if DCACHE_OP_REAL is not
set (this is the common, non-overlayfs case).

In the uncommon case it will call into overlayfs's ->d_real() to get the
underlying dentry, matching file_inode(file).

The reason we need to check against the inode is that if the file is copied
up while being open, d_real() would return the upper dentry, while the open
file comes from the lower dentry.

[*] If possible, it's better simply to use file_inode() instead.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Daniel Axtens <dja@axtens.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-20 15:42:12 +09:00
Hans de Goede
f9a6b3cadd USB: uas: Add a new NO_REPORT_LUNS quirk
commit 1363074667 upstream.

Add a new NO_REPORT_LUNS quirk and set it for Seagate drives with
an usb-id of: 0bc2:331a, as these will fail to respond to a
REPORT_LUNS command.

Reported-and-tested-by: David Webb <djw@noc.ac.uk>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-20 15:42:07 +09:00
Daniel Borkmann
e137eeb38d tun, bpf: fix suspicious RCU usage in tun_{attach, detach}_filter
[ Upstream commit 5a5abb1fa3 ]

Sasha Levin reported a suspicious rcu_dereference_protected() warning
found while fuzzing with trinity that is similar to this one:

  [   52.765684] net/core/filter.c:2262 suspicious rcu_dereference_protected() usage!
  [   52.765688] other info that might help us debug this:
  [   52.765695] rcu_scheduler_active = 1, debug_locks = 1
  [   52.765701] 1 lock held by a.out/1525:
  [   52.765704]  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff816a64b7>] rtnl_lock+0x17/0x20
  [   52.765721] stack backtrace:
  [   52.765728] CPU: 1 PID: 1525 Comm: a.out Not tainted 4.5.0+ #264
  [...]
  [   52.765768] Call Trace:
  [   52.765775]  [<ffffffff813e488d>] dump_stack+0x85/0xc8
  [   52.765784]  [<ffffffff810f2fa5>] lockdep_rcu_suspicious+0xd5/0x110
  [   52.765792]  [<ffffffff816afdc2>] sk_detach_filter+0x82/0x90
  [   52.765801]  [<ffffffffa0883425>] tun_detach_filter+0x35/0x90 [tun]
  [   52.765810]  [<ffffffffa0884ed4>] __tun_chr_ioctl+0x354/0x1130 [tun]
  [   52.765818]  [<ffffffff8136fed0>] ? selinux_file_ioctl+0x130/0x210
  [   52.765827]  [<ffffffffa0885ce3>] tun_chr_ioctl+0x13/0x20 [tun]
  [   52.765834]  [<ffffffff81260ea6>] do_vfs_ioctl+0x96/0x690
  [   52.765843]  [<ffffffff81364af3>] ? security_file_ioctl+0x43/0x60
  [   52.765850]  [<ffffffff81261519>] SyS_ioctl+0x79/0x90
  [   52.765858]  [<ffffffff81003ba2>] do_syscall_64+0x62/0x140
  [   52.765866]  [<ffffffff817d563f>] entry_SYSCALL64_slow_path+0x25/0x25

Same can be triggered with PROVE_RCU (+ PROVE_RCU_REPEATEDLY) enabled
from tun_attach_filter() when user space calls ioctl(tun_fd, TUN{ATTACH,
DETACH}FILTER, ...) for adding/removing a BPF filter on tap devices.

Since the fix in f91ff5b9ff ("net: sk_{detach|attach}_filter() rcu
fixes") sk_attach_filter()/sk_detach_filter() now dereferences the
filter with rcu_dereference_protected(), checking whether socket lock
is held in control path.

Since its introduction in 9940516259 ("tun: socket filter support"),
tap filters are managed under RTNL lock from __tun_chr_ioctl(). Thus the
sock_owned_by_user(sk) doesn't apply in this specific case and therefore
triggers the false positive.

Extend the BPF API with __sk_attach_filter()/__sk_detach_filter() pair
that is used by tap filters and pass in lockdep_rtnl_is_held() for the
rcu_dereference_protected() checks instead.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-20 15:42:06 +09:00
Eric Dumazet
8178211eb7 bonding: fix bond_get_stats()
[ Upstream commit fe30937b65 ]

bond_get_stats() can be called from rtnetlink (with RTNL held)
or from /proc/net/dev seq handler (with RCU held)

The logic added in commit 5f0c5f73e5 ("bonding: make global bonding
stats more reliable") kind of assumed only one cpu could run there.

If multiple threads are reading /proc/net/dev, stats can be really
messed up after a while.

A second problem is that some fields are 32bit, so we need to properly
handle the wrap around problem.

Given that RTNL is not always held, we need to use
bond_for_each_slave_rcu().

Fixes: 5f0c5f73e5 ("bonding: make global bonding stats more reliable")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Andy Gospodarek <gospo@cumulusnetworks.com>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-20 15:42:04 +09:00
Stephen Hemminger
acbea202fb bridge: allow zero ageing time
[ Upstream commit 4c656c13b2 ]

This fixes a regression in the bridge ageing time caused by:
commit c62987bbd8 ("bridge: push bridge setting ageing_time down to switchdev")

There are users of Linux bridge which use the feature that if ageing time
is set to 0 it causes entries to never expire. See:
  https://www.linuxfoundation.org/collaborate/workgroups/networking/bridge

For a pure software bridge, it is unnecessary for the code to have
arbitrary restrictions on what values are allowable.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-20 15:42:02 +09:00
Willem de Bruijn
8b8d278aa4 net: validate variable length ll headers
[ Upstream commit 2793a23aac ]

Netdevice parameter hard_header_len is variously interpreted both as
an upper and lower bound on link layer header length. The field is
used as upper bound when reserving room at allocation, as lower bound
when validating user input in PF_PACKET.

Clarify the definition to be maximum header length. For validation
of untrusted headers, add an optional validate member to header_ops.

Allow bypassing of validation by passing CAP_SYS_RAWIO, for instance
for deliberate testing of corrupt input. In this case, pad trailing
bytes, as some device drivers expect completely initialized headers.

See also http://comments.gmane.org/gmane.linux.network/401064

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-20 15:42:00 +09:00
Benjamin Poirier
d9bbdcd83d mld, igmp: Fix reserved tailroom calculation
[ Upstream commit 1837b2e2bc ]

The current reserved_tailroom calculation fails to take hlen and tlen into
account.

skb:
[__hlen__|__data____________|__tlen___|__extra__]
^                                               ^
head                                            skb_end_offset

In this representation, hlen + data + tlen is the size passed to alloc_skb.
"extra" is the extra space made available in __alloc_skb because of
rounding up by kmalloc. We can reorder the representation like so:

[__hlen__|__data____________|__extra__|__tlen___]
^                                               ^
head                                            skb_end_offset

The maximum space available for ip headers and payload without
fragmentation is min(mtu, data + extra). Therefore,
reserved_tailroom
= data + extra + tlen - min(mtu, data + extra)
= skb_end_offset - hlen - min(mtu, skb_end_offset - hlen - tlen)
= skb_tailroom - min(mtu, skb_tailroom - tlen) ; after skb_reserve(hlen)

Compare the second line to the current expression:
reserved_tailroom = skb_end_offset - min(mtu, skb_end_offset)
and we can see that hlen and tlen are not taken into account.

The min() in the third line can be expanded into:
if mtu < skb_tailroom - tlen:
	reserved_tailroom = skb_tailroom - mtu
else:
	reserved_tailroom = tlen

Depending on hlen, tlen, mtu and the number of multicast address records,
the current code may output skbs that have less tailroom than
dev->needed_tailroom or it may output more skbs than needed because not all
space available is used.

Fixes: 4c672e4b ("ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs")
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-20 15:41:58 +09:00
Linus Lüssing
44bc7d1b97 net: fix bridge multicast packet checksum validation
[ Upstream commit 9b368814b3 ]

We need to update the skb->csum after pulling the skb, otherwise
an unnecessary checksum (re)computation can ocure for IGMP/MLD packets
in the bridge code. Additionally this fixes the following splats for
network devices / bridge ports with support for and enabled RX checksum
offloading:

[...]
[   43.986968] eth0: hw csum failure
[   43.990344] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.4.0 #2
[   43.996193] Hardware name: BCM2709
[   43.999647] [<800204e0>] (unwind_backtrace) from [<8001cf14>] (show_stack+0x10/0x14)
[   44.007432] [<8001cf14>] (show_stack) from [<801ab614>] (dump_stack+0x80/0x90)
[   44.014695] [<801ab614>] (dump_stack) from [<802e4548>] (__skb_checksum_complete+0x6c/0xac)
[   44.023090] [<802e4548>] (__skb_checksum_complete) from [<803a055c>] (ipv6_mc_validate_checksum+0x104/0x178)
[   44.032959] [<803a055c>] (ipv6_mc_validate_checksum) from [<802e111c>] (skb_checksum_trimmed+0x130/0x188)
[   44.042565] [<802e111c>] (skb_checksum_trimmed) from [<803a06e8>] (ipv6_mc_check_mld+0x118/0x338)
[   44.051501] [<803a06e8>] (ipv6_mc_check_mld) from [<803b2c98>] (br_multicast_rcv+0x5dc/0xd00)
[   44.060077] [<803b2c98>] (br_multicast_rcv) from [<803aa510>] (br_handle_frame_finish+0xac/0x51c)
[...]

Fixes: 9afd85c9e4 ("net: Export IGMP/MLD message validation code")
Reported-by: Álvaro Fernández Rojas <noltari@gmail.com>
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-20 15:41:58 +09:00
Paolo Bonzini
d4429b81f6 compiler-gcc: disable -ftracer for __noclone functions
commit 95272c2937 upstream.

-ftracer can duplicate asm blocks causing compilation to fail in
noclone functions.  For example, KVM declares a global variable
in an asm like

    asm("2: ... \n
         .pushsection data \n
         .global vmx_return \n
         vmx_return: .long 2b");

and -ftracer causes a double declaration.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Marek <mmarek@suse.cz>
Cc: kvm@vger.kernel.org
Reported-by: Linda Walsh <lkml@tlinx.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-20 15:41:54 +09:00
Alex Shi
ad592b70ae Merge tag 'v4.4.7' into linux-linaro-lsk-v4.4
This is the 4.4.7 stable release
2016-04-13 12:02:17 +08:00
Peter Zijlstra
2bfb843538 bitops: Do not default to __clear_bit() for __clear_bit_unlock()
commit f75d48644c upstream.

__clear_bit_unlock() is a special little snowflake. While it carries the
non-atomic '__' prefix, it is specifically documented to pair with
test_and_set_bit() and therefore should be 'somewhat' atomic.

Therefore the generic implementation of __clear_bit_unlock() cannot use
the fully non-atomic __clear_bit() as a default.

If an arch is able to do better; is must provide an implementation of
__clear_bit_unlock() itself.

Specifically, this came up as a result of hackbench livelock'ing in
slab_lock() on ARC with SMP + SLUB + !LLSC.

The issue was incorrect pairing of atomic ops.

 slab_lock() -> bit_spin_lock() -> test_and_set_bit()
 slab_unlock() -> __bit_spin_unlock() -> __clear_bit()

The non serializing __clear_bit() was getting "lost"

 80543b8e:	ld_s       r2,[r13,0] <--- (A) Finds PG_locked is set
 80543b90:	or         r3,r2,1    <--- (B) other core unlocks right here
 80543b94:	st_s       r3,[r13,0] <--- (C) sets PG_locked (overwrites unlock)

Fixes ARC STAR 9000817404 (and probably more).

Reported-by: Vineet Gupta <Vineet.Gupta1@synopsys.com>
Tested-by: Vineet Gupta <Vineet.Gupta1@synopsys.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Helge Deller <deller@gmx.de>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Noam Camus <noamc@ezchip.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160309114054.GJ6356@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12 09:09:00 -07:00
Steven Rostedt (Red Hat)
3dba3f672d tracing: Fix trace_printk() to print when not using bprintk()
commit 3debb0a9dd upstream.

The trace_printk() code will allocate extra buffers if the compile detects
that a trace_printk() is used. To do this, the format of the trace_printk()
is saved to the __trace_printk_fmt section, and if that section is bigger
than zero, the buffers are allocated (along with a message that this has
happened).

If trace_printk() uses a format that is not a constant, and thus something
not guaranteed to be around when the print happens, the compiler optimizes
the fmt out, as it is not used, and the __trace_printk_fmt section is not
filled. This means the kernel will not allocate the special buffers needed
for the trace_printk() and the trace_printk() will not write anything to the
tracing buffer.

Adding a "__used" to the variable in the __trace_printk_fmt section will
keep it around, even though it is set to NULL. This will keep the string
from being printed in the debugfs/tracing/printk_formats section as it is
not needed.

Reported-by: Vlastimil Babka <vbabka@suse.cz>
Fixes: 07d777fe8c "tracing: Add percpu buffers for trace_printk()"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12 09:09:00 -07:00
Jann Horn
74b23f79f1 fs/coredump: prevent fsuid=0 dumps into user-controlled directories
commit 378c6520e7 upstream.

This commit fixes the following security hole affecting systems where
all of the following conditions are fulfilled:

 - The fs.suid_dumpable sysctl is set to 2.
 - The kernel.core_pattern sysctl's value starts with "/". (Systems
   where kernel.core_pattern starts with "|/" are not affected.)
 - Unprivileged user namespace creation is permitted. (This is
   true on Linux >=3.8, but some distributions disallow it by
   default using a distro patch.)

Under these conditions, if a program executes under secure exec rules,
causing it to run with the SUID_DUMP_ROOT flag, then unshares its user
namespace, changes its root directory and crashes, the coredump will be
written using fsuid=0 and a path derived from kernel.core_pattern - but
this path is interpreted relative to the root directory of the process,
allowing the attacker to control where a coredump will be written with
root privileges.

To fix the security issue, always interpret core_pattern for dumps that
are written under SUID_DUMP_ROOT relative to the root directory of init.

Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12 09:08:58 -07:00
Tejun Heo
36591ef19a cgroup: ignore css_sets associated with dead cgroups during migration
commit 2b021cbf3c upstream.

Before 2e91fa7f6d ("cgroup: keep zombies associated with their
original cgroups"), all dead tasks were associated with init_css_set.
If a zombie task is requested for migration, while migration prep
operations would still be performed on init_css_set, the actual
migration would ignore zombie tasks.  As init_css_set is always valid,
this worked fine.

However, after 2e91fa7f6d, zombie tasks stay with the css_set it was
associated with at the time of death.  Let's say a task T associated
with cgroup A on hierarchy H-1 and cgroup B on hiearchy H-2.  After T
becomes a zombie, it would still remain associated with A and B.  If A
only contains zombie tasks, it can be removed.  On removal, A gets
marked offline but stays pinned until all zombies are drained.  At
this point, if migration is initiated on T to a cgroup C on hierarchy
H-2, migration path would try to prepare T's css_set for migration and
trigger the following.

 WARNING: CPU: 0 PID: 1576 at kernel/cgroup.c:474 cgroup_get+0x121/0x160()
 CPU: 0 PID: 1576 Comm: bash Not tainted 4.4.0-work+ #289
 ...
 Call Trace:
  [<ffffffff8127e63c>] dump_stack+0x4e/0x82
  [<ffffffff810445e8>] warn_slowpath_common+0x78/0xb0
  [<ffffffff810446d5>] warn_slowpath_null+0x15/0x20
  [<ffffffff810c33e1>] cgroup_get+0x121/0x160
  [<ffffffff810c349b>] link_css_set+0x7b/0x90
  [<ffffffff810c4fbc>] find_css_set+0x3bc/0x5e0
  [<ffffffff810c5269>] cgroup_migrate_prepare_dst+0x89/0x1f0
  [<ffffffff810c7547>] cgroup_attach_task+0x157/0x230
  [<ffffffff810c7a17>] __cgroup_procs_write+0x2b7/0x470
  [<ffffffff810c7bdc>] cgroup_tasks_write+0xc/0x10
  [<ffffffff810c4790>] cgroup_file_write+0x30/0x1b0
  [<ffffffff811c68fc>] kernfs_fop_write+0x13c/0x180
  [<ffffffff81151673>] __vfs_write+0x23/0xe0
  [<ffffffff81152494>] vfs_write+0xa4/0x1a0
  [<ffffffff811532d4>] SyS_write+0x44/0xa0
  [<ffffffff814af2d7>] entry_SYSCALL_64_fastpath+0x12/0x6f

It doesn't make sense to prepare migration for css_sets pointing to
dead cgroups as they are guaranteed to contain only zombies which are
ignored later during migration.  This patch makes cgroup destruction
path mark all affected css_sets as dead and updates the migration path
to ignore them during preparation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 2e91fa7f6d ("cgroup: keep zombies associated with their original cgroups")
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12 09:08:54 -07:00
Peter Hurley
583aacb1f6 tty: Fix GPF in flush_to_ldisc(), part 2
commit f33798deec upstream.

commit 9ce119f318 ("tty: Fix GPF in flush_to_ldisc()") fixed a
GPF caused by a line discipline which does not define a receive_buf()
method.

However, the vt driver (and speakup driver also) pushes selection
data directly to the line discipline receive_buf() method via
tty_ldisc_receive_buf(). Fix the same problem in tty_ldisc_receive_buf().

Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12 09:08:49 -07:00
DingXiang
15c3af026b dm snapshot: disallow the COW and origin devices from being identical
commit 4df2bf466a upstream.

Otherwise loading a "snapshot" table using the same device for the
origin and COW devices, e.g.:

echo "0 20971520 snapshot 253:3 253:3 P 8" | dmsetup create snap

will trigger:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
[ 1958.979934] IP: [<ffffffffa040efba>] dm_exception_store_set_chunk_size+0x7a/0x110 [dm_snapshot]
[ 1958.989655] PGD 0
[ 1958.991903] Oops: 0000 [#1] SMP
...
[ 1959.059647] CPU: 9 PID: 3556 Comm: dmsetup Tainted: G          IO    4.5.0-rc5.snitm+ #150
...
[ 1959.083517] task: ffff8800b9660c80 ti: ffff88032a954000 task.ti: ffff88032a954000
[ 1959.091865] RIP: 0010:[<ffffffffa040efba>]  [<ffffffffa040efba>] dm_exception_store_set_chunk_size+0x7a/0x110 [dm_snapshot]
[ 1959.104295] RSP: 0018:ffff88032a957b30  EFLAGS: 00010246
[ 1959.110219] RAX: 0000000000000000 RBX: 0000000000000008 RCX: 0000000000000001
[ 1959.118180] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff880329334a00
[ 1959.126141] RBP: ffff88032a957b50 R08: 0000000000000000 R09: 0000000000000001
[ 1959.134102] R10: 000000000000000a R11: f000000000000000 R12: ffff880330884d80
[ 1959.142061] R13: 0000000000000008 R14: ffffc90001c13088 R15: ffff880330884d80
[ 1959.150021] FS:  00007f8926ba3840(0000) GS:ffff880333440000(0000) knlGS:0000000000000000
[ 1959.159047] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1959.165456] CR2: 0000000000000098 CR3: 000000032f48b000 CR4: 00000000000006e0
[ 1959.173415] Stack:
[ 1959.175656]  ffffc90001c13040 ffff880329334a00 ffff880330884ed0 ffff88032a957bdc
[ 1959.183946]  ffff88032a957bb8 ffffffffa040f225 ffff880329334a30 ffff880300000000
[ 1959.192233]  ffffffffa04133e0 ffff880329334b30 0000000830884d58 00000000569c58cf
[ 1959.200521] Call Trace:
[ 1959.203248]  [<ffffffffa040f225>] dm_exception_store_create+0x1d5/0x240 [dm_snapshot]
[ 1959.211986]  [<ffffffffa040d310>] snapshot_ctr+0x140/0x630 [dm_snapshot]
[ 1959.219469]  [<ffffffffa0005c44>] ? dm_split_args+0x64/0x150 [dm_mod]
[ 1959.226656]  [<ffffffffa0005ea7>] dm_table_add_target+0x177/0x440 [dm_mod]
[ 1959.234328]  [<ffffffffa0009203>] table_load+0x143/0x370 [dm_mod]
[ 1959.241129]  [<ffffffffa00090c0>] ? retrieve_status+0x1b0/0x1b0 [dm_mod]
[ 1959.248607]  [<ffffffffa0009e35>] ctl_ioctl+0x255/0x4d0 [dm_mod]
[ 1959.255307]  [<ffffffff813304e2>] ? memzero_explicit+0x12/0x20
[ 1959.261816]  [<ffffffffa000a0c3>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
[ 1959.268615]  [<ffffffff81215eb6>] do_vfs_ioctl+0xa6/0x5c0
[ 1959.274637]  [<ffffffff81120d2f>] ? __audit_syscall_entry+0xaf/0x100
[ 1959.281726]  [<ffffffff81003176>] ? do_audit_syscall_entry+0x66/0x70
[ 1959.288814]  [<ffffffff81216449>] SyS_ioctl+0x79/0x90
[ 1959.294450]  [<ffffffff8167e4ae>] entry_SYSCALL_64_fastpath+0x12/0x71
...
[ 1959.323277] RIP  [<ffffffffa040efba>] dm_exception_store_set_chunk_size+0x7a/0x110 [dm_snapshot]
[ 1959.333090]  RSP <ffff88032a957b30>
[ 1959.336978] CR2: 0000000000000098
[ 1959.344121] ---[ end trace b049991ccad1169e ]---

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1195899
Signed-off-by: Ding Xiang <dingxiang@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12 09:08:39 -07:00
Bjorn Helgaas
8cbac3c4f7 PCI: Disable IO/MEM decoding for devices with non-compliant BARs
commit b84106b4e2 upstream.

The PCI config header (first 64 bytes of each device's config space) is
defined by the PCI spec so generic software can identify the device and
manage its usage of I/O, memory, and IRQ resources.

Some non-spec-compliant devices put registers other than BARs where the
BARs should be.  When the PCI core sizes these "BARs", the reads and writes
it does may have unwanted side effects, and the "BAR" may appear to
describe non-sensical address space.

Add a flag bit to mark non-compliant devices so we don't touch their BARs.
Turn off IO/MEM decoding to prevent the devices from consuming address
space, since we can't read the BARs to find out what that address space
would be.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12 09:08:37 -07:00
Zhang Rui
d75936f3f9 Thermal: Ignore invalid trip points
commit 81ad4276b5 upstream.

In some cases, platform thermal driver may report invalid trip points,
thermal core should not take any action for these trip points.

This fixed a regression that bogus trip point starts to screw up thermal
control on some Lenovo laptops, after
commit bb431ba26c
Author: Zhang Rui <rui.zhang@intel.com>
Date:   Fri Oct 30 16:31:47 2015 +0800

    Thermal: initialize thermal zone device correctly

    After thermal zone device registered, as we have not read any
    temperature before, thus tz->temperature should not be 0,
    which actually means 0C, and thermal trend is not available.
    In this case, we need specially handling for the first
    thermal_zone_device_update().

    Both thermal core framework and step_wise governor is
    enhanced to handle this. And since the step_wise governor
    is the only one that uses trends, so it's the only thermal
    governor that needs to be updated.

    Tested-by: Manuel Krause <manuelkrause@netscape.net>
    Tested-by: szegad <szegadlo@poczta.onet.pl>
    Tested-by: prash <prash.n.rao@gmail.com>
    Tested-by: amish <ammdispose-arch@yahoo.com>
    Tested-by: Matthias <morpheusxyz123@yahoo.de>
    Reviewed-by: Javi Merino <javi.merino@arm.com>
    Signed-off-by: Zhang Rui <rui.zhang@intel.com>
    Signed-off-by: Chen Yu <yu.c.chen@intel.com>

Link: https://bugzilla.redhat.com/show_bug.cgi?id=1317190
Link: https://bugzilla.kernel.org/show_bug.cgi?id=114551
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12 09:08:35 -07:00
Arnd Bergmann
e08f9a7c0e ASoC: samsung: pass DMA channels as pointers
commit b9a1a74381 upstream.

ARM64 allmodconfig produces a bunch of warnings when building the
samsung ASoC code:

sound/soc/samsung/dmaengine.c: In function 'samsung_asoc_init_dma_data':
sound/soc/samsung/dmaengine.c:53:32: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
   playback_data->filter_data = (void *)playback->channel;
sound/soc/samsung/dmaengine.c:60:31: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
   capture_data->filter_data = (void *)capture->channel;

We could easily shut up the warning by adding an intermediate cast,
but there is a bigger underlying problem: The use of IORESOURCE_DMA
to pass data from platform code to device drivers is dubious to start
with, as what we really want is a pointer that can be passed into
a filter function.

Note that on s3c64xx, the pl08x DMA data is already a pointer, but
gets cast to resource_size_t so we can pass it as a resource, and it
then gets converted back to a pointer. In contrast, the data we pass
for s3c24xx is an index into a device specific table, and we artificially
convert that into a pointer for the filter function.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12 09:08:32 -07:00
Alex Shi
f96884662e Merge remote-tracking branch 'origin/v4.4/topic/OPPv2' into linux-linaro-lsk-v4.4 2016-04-08 13:59:30 +08:00
Viresh Kumar
6229dc0db2 PM / OPP: Add dev_pm_opp_set_rate()
This adds a routine, dev_pm_opp_set_rate(), responsible for configuring
power-supply and clock source for an OPP.

The OPP is found by matching against the target_freq passed to the
routine. This shall replace similar code present in most of the OPP
users and help simplify them a lot.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 6a0712f6f1)
Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-04-08 11:24:28 +08:00
Viresh Kumar
2f5f3fb48d PM / OPP: Introduce dev_pm_opp_get_max_transition_latency()
In few use cases (like: cpufreq), it is desired to get the maximum
latency for changing OPPs. Add support for that.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 2174344765)
Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-04-08 11:24:26 +08:00
Viresh Kumar
525680c1a9 PM / OPP: Introduce dev_pm_opp_get_max_volt_latency()
In few use cases (like: cpufreq), it is desired to get the maximum
voltage latency for changing OPPs. Add support for that.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 655c9df961)
Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-04-08 11:24:25 +08:00
Viresh Kumar
c070696dd6 PM / OPP: get/put regulators from OPP core
This allows the OPP core to request/free the regulator resource,
attached to a device OPP. The regulator device is fetched using the name
provided by the driver, while calling: dev_pm_opp_set_regulator().

This will work for both OPP-v1 and v2 bindings.

This is a preliminary step for moving the OPP switching logic into the
OPP core.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 9f8ea969d5)
Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-04-08 11:24:24 +08:00
Viresh Kumar
6a0f8e913d PM / OPP: Parse 'opp-<prop>-<name>' bindings
OPP bindings (for few properties) allow a platform to choose a
value/range among a set of available options. The options are present as
opp-<prop>-<name>, where the platform needs to supply the <name> string.

The OPP properties which allow such an option are: opp-microvolt and
opp-microamp.

Add support to the OPP-core to parse these bindings, by introducing
dev_pm_opp_{set|put}_prop_name() APIs.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 01fb4d3c39)
Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-04-08 11:22:24 +08:00
Viresh Kumar
ae6a227802 PM / OPP: Parse 'opp-supported-hw' binding
OPP bindings allow a platform to enable OPPs based on the version of the
hardware they are used for.

Add support to the OPP-core to parse these bindings, by introducing
dev_pm_opp_{set|put}_supported_hw() APIs.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 7de36b0aa5)
Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-04-08 11:22:24 +08:00
Mark Brown
917a9133a6 Merge remote-tracking branch 'lsk/linux-linaro-lsk-v4.4' into linux-linaro-lsk-v4.4 2016-03-18 09:48:58 +00:00
Mark Brown
ddbcfcba5f Merge tag 'v4.4.5' into linux-linaro-lsk-v4.4
This is the 4.4.5 stable release

# gpg: Signature made Wed 09 Mar 2016 23:36:03 GMT using RSA key ID 6092693E
# gpg: Good signature from "Greg Kroah-Hartman (Linux kernel stable release signing key) <greg@kroah.com>"
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg:          There is no indication that the signature belongs to the owner.
# Primary key fingerprint: 647F 2865 4894 E3BD 4571  99BE 38DB BDC8 6092 693E
2016-03-18 09:45:54 +00:00
Mark Brown
dfabba9c37 Merge remote-tracking branch 'lsk/v4.4/topic/ro-vdso' into linux-linaro-lsk-v4.4 2016-03-17 18:54:33 +00:00
Kees Cook
7ece66e296 arch: Introduce post-init read-only memory
commit c74ba8b348 upstream.

One of the easiest ways to protect the kernel from attack is to reduce
the internal attack surface exposed when a "write" flaw is available. By
making as much of the kernel read-only as possible, we reduce the
attack surface.

Many things are written to only during __init, and never changed
again. These cannot be made "const" since the compiler will do the wrong
thing (we do actually need to write to them). Instead, move these items
into a memory region that will be made read-only during mark_rodata_ro()
which happens after all kernel __init code has finished.

This introduces __ro_after_init as a way to mark such memory, and adds
some documentation about the existing __read_mostly marking.

This improves the security of the Linux kernel by marking formerly
read-write memory regions as read-only on a fully booted up system.

Based on work by PaX Team and Brad Spengler.

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Brown <david.brown@linaro.org>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Emese Revfy <re.emese@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathias Krause <minipli@googlemail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: PaX Team <pageexec@freemail.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-hardening@lists.openwall.com
Cc: linux-arch <linux-arch@vger.kernel.org>
Link: http://lkml.kernel.org/r/1455748879-21872-5-git-send-email-keescook@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: David Brown <david.brown@linaro.org>
Signed-off-by: Mark Brown <broonie@kernel.org>
2016-03-17 18:51:51 +00:00
Kees Cook
52596251f6 asm-generic: Consolidate mark_rodata_ro()
commit e267d97b83 upstream.

Instead of defining mark_rodata_ro() in each architecture, consolidate it.

Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Gross <agross@codeaurora.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ashok Kumar <ashoks@broadcom.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Brown <david.brown@linaro.org>
Cc: David Hildenbrand <dahi@linux.vnet.ibm.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Emese Revfy <re.emese@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Helge Deller <deller@gmx.de>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathias Krause <minipli@googlemail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: PaX Team <pageexec@freemail.hu>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: kernel-hardening@lists.openwall.com
Cc: linux-arch <linux-arch@vger.kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Link: http://lkml.kernel.org/r/1455748879-21872-2-git-send-email-keescook@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: David Brown <david.brown@linaro.org>
Signed-off-by: Mark Brown <broonie@kernel.org>
2016-03-17 18:51:51 +00:00
Ming Lei
b59ea3efba block: don't optimize for non-cloned bio in bio_get_last_bvec()
commit 90d0f0f115 upstream.

For !BIO_CLONED bio, we can use .bi_vcnt safely, but it
doesn't mean we can just simply return .bi_io_vec[.bi_vcnt - 1]
because the start postion may have been moved in the middle of
the bvec, such as splitting in the middle of bvec.

Fixes: 7bcd79ac50d9(block: bio: introduce helpers to get the 1st and last bvec)
Reported-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-16 08:43:01 -07:00
Johannes Berg
bfed1f518d cfg80211/wext: fix message ordering
commit cb150b9d23 upstream.

Since cfg80211 frequently takes actions from its netdev notifier
call, wireless extensions messages could still be ordered badly
since the wext netdev notifier, since wext is built into the
kernel, runs before the cfg80211 netdev notifier. For example,
the following can happen:

5: wlan1: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN group default
    link/ether 02:00:00:00:01:00 brd ff:ff:ff:ff:ff:ff
5: wlan1: <BROADCAST,MULTICAST,UP>
    link/ether

when setting the interface down causes the wext message.

To also fix this, export the wireless_nlevent_flush() function
and also call it from the cfg80211 notifier.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-16 08:42:59 -07:00
Steven Rostedt (Red Hat)
f3c83858c6 tracing: Fix check for cpu online when event is disabled
commit dc17147de3 upstream.

Commit f37755490f ("tracepoints: Do not trace when cpu is offline") added
a check to make sure that tracepoints only get called when the cpu is
online, as it uses rcu_read_lock_sched() for protection.

Commit 3a630178fd ("tracing: generate RCU warnings even when tracepoints
are disabled") added lockdep checks (including rcu checks) for events that
are not enabled to catch possible RCU issues that would only be triggered if
a trace event was enabled. Commit f37755490f only stopped the warnings
when the trace event was enabled but did not prevent warnings if the trace
event was called when disabled.

To fix this, the cpu online check is moved to where the condition is added
to the trace event. This will place the cpu online check in all places that
it may be used now and in the future.

Fixes: f37755490f ("tracepoints: Do not trace when cpu is offline")
Fixes: 3a630178fd ("tracing: generate RCU warnings even when tracepoints are disabled")
Reported-by: Sudeep Holla <sudeep.holla@arm.com>
Tested-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-16 08:42:58 -07:00
Rusty Russell
610dde5afb modules: fix longstanding /proc/kallsyms vs module insertion race.
commit 8244062ef1 upstream.

For CONFIG_KALLSYMS, we keep two symbol tables and two string tables.
There's one full copy, marked SHF_ALLOC and laid out at the end of the
module's init section.  There's also a cut-down version that only
contains core symbols and strings, and lives in the module's core
section.

After module init (and before we free the module memory), we switch
the mod->symtab, mod->num_symtab and mod->strtab to point to the core
versions.  We do this under the module_mutex.

However, kallsyms doesn't take the module_mutex: it uses
preempt_disable() and rcu tricks to walk through the modules, because
it's used in the oops path.  It's also used in /proc/kallsyms.
There's nothing atomic about the change of these variables, so we can
get the old (larger!) num_symtab and the new symtab pointer; in fact
this is what I saw when trying to reproduce.

By grouping these variables together, we can use a
carefully-dereferenced pointer to ensure we always get one or the
other (the free of the module init section is already done in an RCU
callback, so that's safe).  We allocate the init one at the end of the
module init section, and keep the core one inside the struct module
itself (it could also have been allocated at the end of the module
core, but that's probably overkill).

[ Rebased for 4.4-stable and older, because the following changes aren't
  in the older trees:
  - e022441851: adds arg to is_core_symbol
  - 7523e4dc50: module_init/module_core/init_size/core_size
    become init_layout.base/core_layout.base/init_layout.size/core_layout.size.
]

Reported-by: Weilong Chen <chenweilong@huawei.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=111541
Cc: stable@kernel.org
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-09 15:34:56 -08:00
Ming Lei
f4f0cca3c1 block: get the 1st and last bvec via helpers
commit 25e71a99f1 upstream.

This patch applies the two introduced helpers to
figure out the 1st and last bvec, and fixes the
original way after bio splitting.

Reported-by: Sagi Grimberg <sagig@dev.mellanox.co.il>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-09 15:34:56 -08:00
Ming Lei
0528bdbc44 block: check virt boundary in bio_will_gap()
commit e0af29171a upstream.

In the following patch, the way for figuring out
the last bvec will be changed with a bit cost introduced,
so return immediately if the queue doesn't have virt
boundary limit. Actually most of devices have not
this limit.

Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-09 15:34:56 -08:00
Steven Rostedt (Red Hat)
180c86a4f0 tracing: Do not have 'comm' filter override event 'comm' field
commit e57cbaf0eb upstream.

Commit 9f61668073 "tracing: Allow triggers to filter for CPU ids and
process names" added a 'comm' filter that will filter events based on the
current tasks struct 'comm'. But this now hides the ability to filter events
that have a 'comm' field too. For example, sched_migrate_task trace event.
That has a 'comm' field of the task to be migrated.

 echo 'comm == "bash"' > events/sched_migrate_task/filter

will now filter all sched_migrate_task events for tasks named "bash" that
migrates other tasks (in interrupt context), instead of seeing when "bash"
itself gets migrated.

This fix requires a couple of changes.

1) Change the look up order for filter predicates to look at the events
   fields before looking at the generic filters.

2) Instead of basing the filter function off of the "comm" name, have the
   generic "comm" filter have its own filter_type (FILTER_COMM). Test
   against the type instead of the name to assign the filter function.

3) Add a new "COMM" filter that works just like "comm" but will filter based
   on the current task, even if the trace event contains a "comm" field.

Do the same for "cpu" field, adding a FILTER_CPU and a filter "CPU".

Fixes: 9f61668073 "tracing: Allow triggers to filter for CPU ids and process names"
Reported-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-09 15:34:52 -08:00
Tejun Heo
c5cbbec54f writeback: flush inode cgroup wb switches instead of pinning super_block
commit a1a0e23e49 upstream.

If cgroup writeback is in use, inodes can be scheduled for
asynchronous wb switching.  Before 5ff8eaac16 ("writeback: keep
superblock pinned during cgroup writeback association switches"), this
could race with umount leading to super_block being destroyed while
inodes are pinned for wb switching.  5ff8eaac16 fixed it by bumping
s_active while wb switches are in flight; however, this allowed
in-flight wb switches to make umounts asynchronous when the userland
expected synchronosity - e.g. fsck immediately following umount may
fail because the device is still busy.

This patch removes the problematic super_block pinning and instead
makes generic_shutdown_super() flush in-flight wb switches.  wb
switches are now executed on a dedicated isw_wq so that they can be
flushed and isw_nr_in_flight keeps track of the number of in-flight wb
switches so that flushing can be avoided in most cases.

v2: Move cgroup_writeback_umount() further below and add MS_ACTIVE
    check in inode_switch_wbs() as Jan an Al suggested.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Tahsin Erdogan <tahsin@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com
Fixes: 5ff8eaac16 ("writeback: keep superblock pinned during cgroup writeback association switches")
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-09 15:34:52 -08:00
Ming Lei
7adb5cc0f3 block: bio: introduce helpers to get the 1st and last bvec
commit 7bcd79ac50 upstream.

The bio passed to bio_will_gap() may be fast cloned from upper
layer(dm, md, bcache, fs, ...), or from bio splitting in block
core.

Unfortunately bio_will_gap() just figures out the last bvec via
'bi_io_vec[prev->bi_vcnt - 1]' directly, and this way is obviously
wrong.

This patch introduces two helpers for getting the first and last
bvec of one bio for fixing the issue.

Reported-by: Sagi Grimberg <sagig@dev.mellanox.co.il>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-09 15:34:52 -08:00
Harvey Hunt
cea2cbff57 libata: Align ata_device's id on a cacheline
commit 4ee34ea3a1 upstream.

The id buffer in ata_device is a DMA target, but it isn't explicitly
cacheline aligned. Due to this, adjacent fields can be overwritten with
stale data from memory on non coherent architectures. As a result, the
kernel is sometimes unable to communicate with an ATA device.

Fix this by ensuring that the id buffer is cacheline aligned.

This issue is similar to that fixed by Commit 84bda12af3
("libata: align ap->sector_buf").

Signed-off-by: Harvey Hunt <harvey.hunt@imgtec.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-09 15:34:52 -08:00
Arnd Bergmann
b693f2ad0f libata: fix HDIO_GET_32BIT ioctl
commit 287e6611ab upstream.

As reported by Soohoon Lee, the HDIO_GET_32BIT ioctl does not
work correctly in compat mode with libata.

I have investigated the issue further and found multiple problems
that all appeared with the same commit that originally introduced
HDIO_GET_32BIT handling in libata back in linux-2.6.8 and presumably
also linux-2.4, as the code uses "copy_to_user(arg, &val, 1)" to copy
a 'long' variable containing either 0 or 1 to user space.

The problems with this are:

* On big-endian machines, this will always write a zero because it
  stores the wrong byte into user space.

* In compat mode, the upper three bytes of the variable are updated
  by the compat_hdio_ioctl() function, but they now contain
  uninitialized stack data.

* The hdparm tool calling this ioctl uses a 'static long' variable
  to store the result. This means at least the upper bytes are
  initialized to zero, but calling another ioctl like HDIO_GET_MULTCOUNT
  would fill them with data that remains stale when the low byte
  is overwritten. Fortunately libata doesn't implement any of the
  affected ioctl commands, so this would only happen when we query
  both an IDE and an ATA device in the same command such as
  "hdparm -N -c /dev/hda /dev/sda"

* The libata code for unknown reasons started using ATA_IOC_GET_IO32
  and ATA_IOC_SET_IO32 as aliases for HDIO_GET_32BIT and HDIO_SET_32BIT,
  while the ioctl commands that were added later use the normal
  HDIO_* names. This is harmless but rather confusing.

This addresses all four issues by changing the code to use put_user()
on an 'unsigned long' variable in HDIO_GET_32BIT, like the IDE subsystem
does, and by clarifying the names of the ioctl commands.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reported-by: Soohoon Lee <Soohoon.Lee@f5.com>
Tested-by: Soohoon Lee <Soohoon.Lee@f5.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-09 15:34:52 -08:00
Mike Christie
3028963a28 target: Fix WRITE_SAME/DISCARD conversion to linux 512b sectors
commit 8a9ebe717a upstream.

In a couple places we are not converting to/from the Linux
block layer 512 bytes sectors.

1.

The request queue values and what we do are a mismatch of
things:

max_discard_sectors - This is in linux block layer 512 byte
sectors. We are just copying this to max_unmap_lba_count.

discard_granularity - This is in bytes. We are converting it
to Linux block layer 512 byte sectors.

discard_alignment - This is in bytes. We are just copying
this over.

The problem is that the core LIO code exports these values in
spc_emulate_evpd_b0 and we use them to test request arguments
in sbc_execute_unmap, but we never convert to the block size
we export to the initiator. If we are not using 512 byte sectors
then we are exporting the wrong values or are checks are off.
And, for the discard_alignment/bytes case we are just plain messed
up.

2.

blkdev_issue_discard's start and number of sector arguments
are supposed to be in linux block layer 512 byte sectors. We are
currently passing in the values we get from the initiator which
might be based on some other sector size.

There is a similar problem in iblock_execute_write_same where
the bio functions want values in 512 byte sectors but we are
passing in what we got from the initiator.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
[ kamal: backport to 4.4-stable: no unmap_zeroes_data ]
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-09 15:34:51 -08:00
Al Viro
53729dbbd2 use ->d_seq to get coherency between ->d_inode and ->d_flags
commit a528aca7f3 upstream.

Games with ordering and barriers are way too brittle.  Just
bump ->d_seq before and after updating ->d_inode and ->d_flags
type bits, so that verifying ->d_seq would guarantee they are
coherent.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-09 15:34:49 -08:00
Christoph Hellwig
fd921e5756 nfs: fix nfs_size_to_loff_t
commit 50ab8ec74a upstream.

See http: //www.infradead.org/rpr.html
X-Evolution-Source: 1451162204.2173.11@leira.trondhjem.org
Content-Transfer-Encoding: 8bit
Mime-Version: 1.0

We support OFFSET_MAX just fine, so don't round down below it.  Also
switch to using min_t to make the helper more readable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Fixes: 433c92379d ("NFS: Clean up nfs_size_to_loff_t()")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03 15:07:28 -08:00
Tejun Heo
4cbd196324 cgroup: make sure a parent css isn't offlined before its children
commit aa226ff4a1 upstream.

There are three subsystem callbacks in css shutdown path -
css_offline(), css_released() and css_free().  Except for
css_released(), cgroup core didn't guarantee the order of invocation.
css_offline() or css_free() could be called on a parent css before its
children.  This behavior is unexpected and led to bugs in cpu and
memory controller.

This patch updates offline path so that a parent css is never offlined
before its children.  Each css keeps online_cnt which reaches zero iff
itself and all its children are offline and offline_css() is invoked
only after online_cnt reaches zero.

This fixes the memory controller bug and allows the fix for cpu
controller.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reported-by: Brian Christiansen <brian.o.christiansen@gmail.com>
Link: http://lkml.kernel.org/g/5698A023.9070703@de.ibm.com
Link: http://lkml.kernel.org/g/CAKB58ikDkzc8REt31WBkD99+hxNzjK4+FBmhkgS+NVrC9vjMSg@mail.gmail.com
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03 15:07:28 -08:00
Tejun Heo
fff4dc84e7 cpuset: make mm migration asynchronous
commit e93ad19d05 upstream.

If "cpuset.memory_migrate" is set, when a process is moved from one
cpuset to another with a different memory node mask, pages in used by
the process are migrated to the new set of nodes.  This was performed
synchronously in the ->attach() callback, which is synchronized
against process management.  Recently, the synchronization was changed
from per-process rwsem to global percpu rwsem for simplicity and
optimization.

Combined with the synchronous mm migration, this led to deadlocks
because mm migration could schedule a work item which may in turn try
to create a new worker blocking on the process management lock held
from cgroup process migration path.

This heavy an operation shouldn't be performed synchronously from that
deep inside cgroup migration in the first place.  This patch punts the
actual migration to an ordered workqueue and updates cgroup process
migration and cpuset config update paths to flush the workqueue after
all locks are released.  This way, the operations still seem
synchronous to userland without entangling mm migration with process
management synchronization.  CPU hotplug can also invoke mm migration
but there's no reason for it to wait for mm migrations and thus
doesn't synchronize against their completions.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03 15:07:28 -08:00
Nicholas Bellinger
fb6a326e30 target: Fix remote-port TMR ABORT + se_cmd fabric stop
commit 0f4a943168 upstream.

To address the bug where fabric driver level shutdown
of se_cmd occurs at the same time when TMR CMD_T_ABORTED
is happening resulting in a -1 ->cmd_kref, this patch
adds a CMD_T_FABRIC_STOP bit that is used to determine
when TMR + driver I_T nexus shutdown is happening
concurrently.

It changes target_sess_cmd_list_set_waiting() to obtain
se_cmd->cmd_kref + set CMD_T_FABRIC_STOP, and drop local
reference in target_wait_for_sess_cmds() and invoke extra
target_put_sess_cmd() during Task Aborted Status (TAS)
when necessary.

Also, it adds a new target_wait_free_cmd() wrapper around
transport_wait_for_tasks() for the special case within
transport_generic_free_cmd() to set CMD_T_FABRIC_STOP,
and is now aware of CMD_T_ABORTED + CMD_T_TAS status
bits to know when an extra transport_put_cmd() during
TAS is required.

Note transport_generic_free_cmd() is expected to block on
cmd->cmd_wait_comp in order to follow what iscsi-target
expects during iscsi_conn context se_cmd shutdown.

Cc: Quinn Tran <quinn.tran@qlogic.com>
Cc: Himanshu Madhani <himanshu.madhani@qlogic.com>
Cc: Sagi Grimberg <sagig@mellanox.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Andy Grover <agrover@redhat.com>
Cc: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@daterainc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03 15:07:27 -08:00