Commit Graph

64498 Commits

Author SHA1 Message Date
Andy Lutomirski
5bacea89dc fs,userns: Change inode_capable to capable_wrt_inode_uidgid
commit 23adbe12ef upstream.

The kernel has no concept of capabilities with respect to inodes; inodes
exist independently of namespaces.  For example, inode_capable(inode,
CAP_LINUX_IMMUTABLE) would be nonsense.

This patch changes inode_capable to check for uid and gid mappings and
renames it to capable_wrt_inode_uidgid, which should make it more
obvious what it does.

Fixes CVE-2014-4014.

Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-16 13:40:32 -07:00
Sebastian Ott
1b96b1053c percpu-refcount: fix usage of this_cpu_ops
commit 0c36b390a5 upstream.

The percpu-refcount infrastructure uses the underscore variants of
this_cpu_ops in order to modify percpu reference counters.
(e.g. __this_cpu_inc()).

However the underscore variants do not atomically update the percpu
variable, instead they may be implemented using read-modify-write
semantics (more than one instruction).  Therefore it is only safe to
use the underscore variant if the context is always the same (process,
softirq, or hardirq). Otherwise it is possible to lose updates.

This problem is something that Sebastian has seen within the aio
subsystem which uses percpu refcounters both in process and softirq
context leading to reference counts that never dropped to zeroes; even
though the number of "get" and "put" calls matched.

Fix this by using the non-underscore this_cpu_ops variant which
provides correct per cpu atomic semantics and fixes the corrupted
reference counts.

Cc: Kent Overstreet <kmo@daterainc.com>
Reported-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
References: http://lkml.kernel.org/g/alpine.LFD.2.11.1406041540520.21183@denkbrett
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-11 11:54:13 -07:00
Greg Kroah-Hartman
852b4ede30 USB: cdc-wdm: properly include types.h
commit 7ac3764fca upstream.

The file include/uapi/linux/usb/cdc-wdm.h uses a __u16 so it needs to
include types.h as well to make the build system happy.

Fixes: 3edce1cf81 ("USB: cdc-wdm: implement IOCTL_WDM_MAX_COMMAND")
Cc: Bjørn Mork <bjorn@mork.no>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-11 11:54:12 -07:00
Bjørn Mork
f814958eaf usb: cdc-wdm: export cdc-wdm uapi header
commit 7d1896360f upstream.

The include/uapi/linux/usb/cdc-wdm.h header defines cdc-wdm
userspace APIs and should be exported by make headers_install.

Fixes: 3edce1cf81 ("USB: cdc-wdm: implement IOCTL_WDM_MAX_COMMAND")
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-11 11:54:12 -07:00
Xuelin Shi
e411566e7d dmaengine: fix dmaengine_unmap failure
commit c1f43dd9c2 upstream.

The count which is used to get_unmap_data maybe not the same as the
count computed in dmaengine_unmap which causes to free data in a
wrong pool.

This patch fixes this issue by keeping the map count with unmap_data
structure and use this count to get the pool.

Signed-off-by: Xuelin Shi <xuelin.shi@freescale.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07 10:28:25 -07:00
Arnd Bergmann
bae3622bb8 genirq: Provide irq_force_affinity fallback for non-SMP
commit 4c88d7f9b0 upstream.

Patch 01f8fa4f01 "genirq: Allow forcing cpu affinity of interrupts" added
an irq_force_affinity() function, and 30ccf03b4a "clocksource: Exynos_mct:
Use irq_force_affinity() in cpu bringup" subsequently uses it. However, the
driver can be used with CONFIG_SMP disabled, but the function declaration
is only available for CONFIG_SMP, leading to this build error:

drivers/clocksource/exynos_mct.c:431:3: error: implicit declaration of function 'irq_force_affinity' [-Werror=implicit-function-declaration]
   irq_force_affinity(mct_irqs[MCT_L0_IRQ + cpu], cpumask_of(cpu));

This patch introduces a dummy helper function for the non-SMP case
that always returns success, to get rid of the build error.
Since the patches causing the problem are marked for stable backports,
this one should be as well.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Acked-by: Kukjin Kim <kgene.kim@samsung.com>
Link: http://lkml.kernel.org/r/5619084.0zmrrIUZLV@wuerfel
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07 10:28:24 -07:00
Romain Izard
df02df0a92 trace: module: Maintain a valid user count
commit 098507ae3e upstream.

The replacement of the 'count' variable by two variables 'incs' and
'decs' to resolve some race conditions during module unloading was done
in parallel with some cleanup in the trace subsystem, and was integrated
as a merge.

Unfortunately, the formula for this replacement was wrong in the tracing
code, and the refcount in the traces was not usable as a result.

Use 'count = incs - decs' to compute the user count.

Link: http://lkml.kernel.org/p/1393924179-9147-1-git-send-email-romain.izard.pro@gmail.com

Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Fixes: c1ab9cab75 "merge conflict resolution"
Signed-off-by: Romain Izard <romain.izard.pro@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07 10:28:20 -07:00
Hans de Goede
46ce727cb6 Input: Add INPUT_PROP_TOPBUTTONPAD device property
commit f37c013409 upstream.

On some newer laptops with a trackpoint the physical buttons for the
trackpoint have been removed to allow for a larger touchpad. On these
laptops the buttonpad has clearly marked areas on the top which are to be
used as trackpad buttons.

Users of the event device-node need to know about this, so that they can
properly interpret BTN_LEFT events as being a left / right / middle click
depending on where on the button pad the clicking finger is.

This commits adds a INPUT_PROP_TOPBUTTONPAD device property which drivers
for such buttonpads will use to signal to the user that this buttonpad not
only has the normal bottom button area, but also a top button area.

Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07 10:28:20 -07:00
Hans de Goede
bca4529f14 Input: serio - add firmware_id sysfs attribute
commit 0456c66f4e upstream.

serio devices exposed via platform firmware interfaces such as ACPI may
provide additional identifying information of use to userspace.

We don't associate the serio devices with the firmware device (we don't
set it as parent), so there's no way for userspace to make use of this
information.

We cannot change the parent for serio devices instantiated though a
firmware interface as that would break suspend / resume ordering.

Therefore this patch adds a new firmware_id sysfs attribute so that
userspace can get a string from there with any additional identifying
information the firmware interface may provide.

Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Acked-by: Peter Hutterer <peter.hutterer@who-t.net>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07 10:28:19 -07:00
Thierry Reding
ab19485551 drm/tegra: Remove gratuitous pad field
commit cbfbbabb89 upstream.

The version of the drm_tegra_submit structure that was merged all the
way back in 3.10 contains a pad field that was originally intended to
properly pad the following __u64 field. Unfortunately it seems like a
different field was dropped during review that caused this padding to
become unnecessary, but the pad field wasn't removed at that time.

One possible side-effect of this is that since the __u64 following the
pad is now no longer properly aligned, the compiler may (or may not)
introduce padding itself, which results in no predictable ABI.

Rectify this by removing the pad field so that all fields are again
naturally aligned. Technically this is breaking existing userspace ABI,
but given that there aren't any (released) userspace drivers that make
use of this yet, the fallout should be minimal.

Fixes: d43f81cbaf ("drm/tegra: Add gr2d device")
Signed-off-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07 10:28:15 -07:00
Eliad Peller
853487c377 cfg80211: add cfg80211_sched_scan_stopped_rtnl
commit 792e6aa7a1 upstream.

Add locked-version for cfg80211_sched_scan_stopped.
This is used for some users that might want to
call it when rtnl is already locked.

Fixes: d43c6b6 ("mac80211: reschedule sched scan after HW restart")
Signed-off-by: Eliad Peller <eliadx.peller@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07 10:28:10 -07:00
Thomas Gleixner
8f14108281 genirq: Allow forcing cpu affinity of interrupts
commit 01f8fa4f01 upstream.

The current implementation of irq_set_affinity() refuses rightfully to
route an interrupt to an offline cpu.

But there is a special case, where this is actually desired. Some of
the ARM SoCs have per cpu timers which require setting the affinity
during cpu startup where the cpu is not yet in the online mask.

If we can't do that, then the local timer interrupt for the about to
become online cpu is routed to some random online cpu.

The developers of the affected machines tried to work around that
issue, but that results in a massive mess in that timer code.

We have a yet unused argument in the set_affinity callbacks of the irq
chips, which I added back then for a similar reason. It was never
required so it got not used. But I'm happy that I never removed it.

That allows us to implement a sane handling of the above scenario. So
the affected SoC drivers can add the required force handling to their
interrupt chip, switch the timer code to irq_force_affinity() and
things just work.

This does not affect any existing user of irq_set_affinity().

Tagged for stable to allow a simple fix of the affected SoC clock
event drivers.

Reported-and-tested-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Tomasz Figa <t.figa@samsung.com>,
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>,
Cc: Kukjin Kim <kgene.kim@samsung.com>
Cc: linux-arm-kernel@lists.infradead.org,
Link: http://lkml.kernel.org/r/20140416143315.717251504@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07 10:28:08 -07:00
Rob Herring
274eb7bf1d of/irq: do irq resolution in platform_get_irq
commit 9ec36cafe4 upstream.

Currently we get the following kind of errors if we try to use interrupt
phandles to irqchips that have not yet initialized:

irq: no irq domain found for /ocp/pinmux@48002030 !
------------[ cut here ]------------
WARNING: CPU: 0 PID: 1 at drivers/of/platform.c:171 of_device_alloc+0x144/0x184()
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.12.0-00038-g42a9708 #1012
(show_stack+0x14/0x1c)
(dump_stack+0x6c/0xa0)
(warn_slowpath_common+0x64/0x84)
(warn_slowpath_null+0x1c/0x24)
(of_device_alloc+0x144/0x184)
(of_platform_device_create_pdata+0x44/0x9c)
(of_platform_bus_create+0xd0/0x170)
(of_platform_bus_create+0x12c/0x170)
(of_platform_populate+0x60/0x98)

This is because we're wrongly trying to populate resources that are not
yet available. It's perfectly valid to create irqchips dynamically, so
let's fix up the issue by resolving the interrupt resources when
platform_get_irq is called.

And then we also need to accept the fact that some irqdomains do not
exist that early on, and only get initialized later on. So we can
make the current WARN_ON into just into a pr_debug().

We still attempt to populate irq resources when we create the devices.
This allows current drivers which don't use platform_get_irq to continue
to function. Once all drivers are fixed, this code can be removed.

Suggested-by: Russell King <linux@arm.linux.org.uk>
Signed-off-by: Rob Herring <robh@kernel.org>
Signed-off-by: Tony Lindgren <tony@atomide.com>
Tested-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Grant Likely <grant.likely@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07 10:28:07 -07:00
Steven Rostedt (Red Hat)
d9550cf732 ftrace/module: Hardcode ftrace_module_init() call into load_module()
commit a949ae560a upstream.

A race exists between module loading and enabling of function tracer.

	CPU 1				CPU 2
	-----				-----
  load_module()
   module->state = MODULE_STATE_COMING

				register_ftrace_function()
				 mutex_lock(&ftrace_lock);
				 ftrace_startup()
				  update_ftrace_function();
				   ftrace_arch_code_modify_prepare()
				    set_all_module_text_rw();
				   <enables-ftrace>
				    ftrace_arch_code_modify_post_process()
				     set_all_module_text_ro();

				[ here all module text is set to RO,
				  including the module that is
				  loading!! ]

   blocking_notifier_call_chain(MODULE_STATE_COMING);
    ftrace_init_module()

     [ tries to modify code, but it's RO, and fails!
       ftrace_bug() is called]

When this race happens, ftrace_bug() will produces a nasty warning and
all of the function tracing features will be disabled until reboot.

The simple solution is to treate module load the same way the core
kernel is treated at boot. To hardcode the ftrace function modification
of converting calls to mcount into nops. This is done in init/main.c
there's no reason it could not be done in load_module(). This gives
a better control of the changes and doesn't tie the state of the
module to its notifiers as much. Ftrace is special, it needs to be
treated as such.

The reason this would work, is that the ftrace_module_init() would be
called while the module is in MODULE_STATE_UNFORMED, which is ignored
by the set_all_module_text_ro() call.

Link: http://lkml.kernel.org/r/1395637826-3312-1-git-send-email-indou.takao@jp.fujitsu.com

Reported-by: Takao Indoh <indou.takao@jp.fujitsu.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07 10:28:07 -07:00
Cong Wang
af8c0e0612 rtnetlink: wait for unregistering devices in rtnl_link_unregister()
[ Upstream commit 200b916f35 ]

From: Cong Wang <cwang@twopensource.com>

commit 50624c934d (net: Delay default_device_exit_batch until no
devices are unregistering) introduced rtnl_lock_unregistering() for
default_device_exit_batch(). Same race could happen we when rmmod a driver
which calls rtnl_link_unregister() as we call dev->destructor without rtnl
lock.

For long term, I think we should clean up the mess of netdev_run_todo()
and net namespce exit code.

Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-31 13:20:38 -07:00
Hannes Frederic Sowa
adeb3fe4ef net: avoid dependency of net_get_random_once on nop patching
[ Upstream commit 3d4405226d ]

net_get_random_once depends on the static keys infrastructure to patch up
the branch to the slow path during boot. This was realized by abusing the
static keys api and defining a new initializer to not enable the call
site while still indicating that the branch point should get patched
up. This was needed to have the fast path considered likely by gcc.

The static key initialization during boot up normally walks through all
the registered keys and either patches in ideal nops or enables the jump
site but omitted that step on x86 if ideal nops where already placed at
static_key branch points. Thus net_get_random_once branches not always
became active.

This patch switches net_get_random_once to the ordinary static_key
api and thus places the kernel fast path in the - by gcc considered -
unlikely path.  Microbenchmarks on Intel and AMD x86-64 showed that
the unlikely path actually beats the likely path in terms of cycle cost
and that different nop patterns did not make much difference, thus this
switch should not be noticeable.

Fixes: a48e42920f ("net: introduce new macro net_get_random_once")
Reported-by: Tuomas Räsänen <tuomasjjrasanen@tjjr.fi>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-31 13:20:38 -07:00
Andy King
1b3ac8488e vsock: Make transport the proto owner
[ Upstream commit 2c4a336e0a ]

Right now the core vsock module is the owner of the proto family. This
means there's nothing preventing the transport module from unloading if
there are open sockets, which results in a panic. Fix that by allowing
the transport to be the owner, which will refcount it properly.

Includes version bump to 1.0.1.0-k

Passes checkpatch this time, I swear...

Acked-by: Dmitry Torokhov <dtor@vmware.com>
Signed-off-by: Andy King <acking@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-31 13:20:36 -07:00
Andrew Lutomirski
265bcb0ea1 net: Fix ns_capable check in sock_diag_put_filterinfo
[ Upstream commit 78541c1dc6 ]

The caller needs capabilities on the namespace being queried, not on
their own namespace.  This is a security bug, although it likely has
only a minor impact.

Cc: stable@vger.kernel.org
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-31 13:20:34 -07:00
Vlad Yasevich
3938b0336a net: sctp: cache auth_enable per endpoint
[ Upstream commit b14878ccb7 ]

Currently, it is possible to create an SCTP socket, then switch
auth_enable via sysctl setting to 1 and crash the system on connect:

Oops[#1]:
CPU: 0 PID: 0 Comm: swapper Not tainted 3.14.1-mipsgit-20140415 #1
task: ffffffff8056ce80 ti: ffffffff8055c000 task.ti: ffffffff8055c000
[...]
Call Trace:
[<ffffffff8043c4e8>] sctp_auth_asoc_set_default_hmac+0x68/0x80
[<ffffffff8042b300>] sctp_process_init+0x5e0/0x8a4
[<ffffffff8042188c>] sctp_sf_do_5_1B_init+0x234/0x34c
[<ffffffff804228c8>] sctp_do_sm+0xb4/0x1e8
[<ffffffff80425a08>] sctp_endpoint_bh_rcv+0x1c4/0x214
[<ffffffff8043af68>] sctp_rcv+0x588/0x630
[<ffffffff8043e8e8>] sctp6_rcv+0x10/0x24
[<ffffffff803acb50>] ip6_input+0x2c0/0x440
[<ffffffff8030fc00>] __netif_receive_skb_core+0x4a8/0x564
[<ffffffff80310650>] process_backlog+0xb4/0x18c
[<ffffffff80313cbc>] net_rx_action+0x12c/0x210
[<ffffffff80034254>] __do_softirq+0x17c/0x2ac
[<ffffffff800345e0>] irq_exit+0x54/0xb0
[<ffffffff800075a4>] ret_from_irq+0x0/0x4
[<ffffffff800090ec>] rm7k_wait_irqoff+0x24/0x48
[<ffffffff8005e388>] cpu_startup_entry+0xc0/0x148
[<ffffffff805a88b0>] start_kernel+0x37c/0x398
Code: dd0900b8  000330f8  0126302d <dcc60000> 50c0fff1  0047182a  a48306a0
03e00008  00000000
---[ end trace b530b0551467f2fd ]---
Kernel panic - not syncing: Fatal exception in interrupt

What happens while auth_enable=0 in that case is, that
ep->auth_hmacs is initialized to NULL in sctp_auth_init_hmacs()
when endpoint is being created.

After that point, if an admin switches over to auth_enable=1,
the machine can crash due to NULL pointer dereference during
reception of an INIT chunk. When we enter sctp_process_init()
via sctp_sf_do_5_1B_init() in order to respond to an INIT chunk,
the INIT verification succeeds and while we walk and process
all INIT params via sctp_process_param() we find that
net->sctp.auth_enable is set, therefore do not fall through,
but invoke sctp_auth_asoc_set_default_hmac() instead, and thus,
dereference what we have set to NULL during endpoint
initialization phase.

The fix is to make auth_enable immutable by caching its value
during endpoint initialization, so that its original value is
being carried along until destruction. The bug seems to originate
from the very first days.

Fix in joint work with Daniel Borkmann.

Reported-by: Joshua Kinard <kumba@gentoo.org>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Tested-by: Joshua Kinard <kumba@gentoo.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-31 13:20:34 -07:00
Vlad Yasevich
1c030bf16f macvlan: Fix lockdep warnings with stacked macvlan devices
[ Upstream commit c674ac30c5 ]

Macvlan devices try to avoid stacking, but that's not always
successfull or even desired.  As an example, the following
configuration is perefectly legal and valid:

eth0 <--- macvlan0 <---- vlan0.10 <--- macvlan1

However, this configuration produces the following lockdep
trace:
[  115.620418] ======================================================
[  115.620477] [ INFO: possible circular locking dependency detected ]
[  115.620516] 3.15.0-rc1+ #24 Not tainted
[  115.620540] -------------------------------------------------------
[  115.620577] ip/1704 is trying to acquire lock:
[  115.620604]  (&vlan_netdev_addr_lock_key/1){+.....}, at: [<ffffffff815df49c>] dev_uc_sync+0x3c/0x80
[  115.620686]
but task is already holding lock:
[  115.620723]  (&macvlan_netdev_addr_lock_key){+.....}, at: [<ffffffff815da5be>] dev_set_rx_mode+0x1e/0x40
[  115.620795]
which lock already depends on the new lock.

[  115.620853]
the existing dependency chain (in reverse order) is:
[  115.620894]
-> #1 (&macvlan_netdev_addr_lock_key){+.....}:
[  115.620935]        [<ffffffff810d57f2>] lock_acquire+0xa2/0x130
[  115.620974]        [<ffffffff816f62e7>] _raw_spin_lock_nested+0x37/0x50
[  115.621019]        [<ffffffffa07296c3>] vlan_dev_set_rx_mode+0x53/0x110 [8021q]
[  115.621066]        [<ffffffff815da557>] __dev_set_rx_mode+0x57/0xa0
[  115.621105]        [<ffffffff815da5c6>] dev_set_rx_mode+0x26/0x40
[  115.621143]        [<ffffffff815da6be>] __dev_open+0xde/0x140
[  115.621174]        [<ffffffff815da9ad>] __dev_change_flags+0x9d/0x170
[  115.621174]        [<ffffffff815daaa9>] dev_change_flags+0x29/0x60
[  115.621174]        [<ffffffff815e7f11>] do_setlink+0x321/0x9a0
[  115.621174]        [<ffffffff815ea59f>] rtnl_newlink+0x51f/0x730
[  115.621174]        [<ffffffff815e6e75>] rtnetlink_rcv_msg+0x95/0x250
[  115.621174]        [<ffffffff81608b19>] netlink_rcv_skb+0xa9/0xc0
[  115.621174]        [<ffffffff815e6dca>] rtnetlink_rcv+0x2a/0x40
[  115.621174]        [<ffffffff81608150>] netlink_unicast+0xf0/0x1c0
[  115.621174]        [<ffffffff8160851f>] netlink_sendmsg+0x2ff/0x740
[  115.621174]        [<ffffffff815bc9db>] sock_sendmsg+0x8b/0xc0
[  115.621174]        [<ffffffff815bd4b9>] ___sys_sendmsg+0x369/0x380
[  115.621174]        [<ffffffff815bdbb2>] __sys_sendmsg+0x42/0x80
[  115.621174]        [<ffffffff815bdc02>] SyS_sendmsg+0x12/0x20
[  115.621174]        [<ffffffff816ffd69>] system_call_fastpath+0x16/0x1b
[  115.621174]
-> #0 (&vlan_netdev_addr_lock_key/1){+.....}:
[  115.621174]        [<ffffffff810d4d43>] __lock_acquire+0x1773/0x1a60
[  115.621174]        [<ffffffff810d57f2>] lock_acquire+0xa2/0x130
[  115.621174]        [<ffffffff816f62e7>] _raw_spin_lock_nested+0x37/0x50
[  115.621174]        [<ffffffff815df49c>] dev_uc_sync+0x3c/0x80
[  115.621174]        [<ffffffffa0696d2a>] macvlan_set_mac_lists+0xca/0x110 [macvlan]
[  115.621174]        [<ffffffff815da557>] __dev_set_rx_mode+0x57/0xa0
[  115.621174]        [<ffffffff815da5c6>] dev_set_rx_mode+0x26/0x40
[  115.621174]        [<ffffffff815da6be>] __dev_open+0xde/0x140
[  115.621174]        [<ffffffff815da9ad>] __dev_change_flags+0x9d/0x170
[  115.621174]        [<ffffffff815daaa9>] dev_change_flags+0x29/0x60
[  115.621174]        [<ffffffff815e7f11>] do_setlink+0x321/0x9a0
[  115.621174]        [<ffffffff815ea59f>] rtnl_newlink+0x51f/0x730
[  115.621174]        [<ffffffff815e6e75>] rtnetlink_rcv_msg+0x95/0x250
[  115.621174]        [<ffffffff81608b19>] netlink_rcv_skb+0xa9/0xc0
[  115.621174]        [<ffffffff815e6dca>] rtnetlink_rcv+0x2a/0x40
[  115.621174]        [<ffffffff81608150>] netlink_unicast+0xf0/0x1c0
[  115.621174]        [<ffffffff8160851f>] netlink_sendmsg+0x2ff/0x740
[  115.621174]        [<ffffffff815bc9db>] sock_sendmsg+0x8b/0xc0
[  115.621174]        [<ffffffff815bd4b9>] ___sys_sendmsg+0x369/0x380
[  115.621174]        [<ffffffff815bdbb2>] __sys_sendmsg+0x42/0x80
[  115.621174]        [<ffffffff815bdc02>] SyS_sendmsg+0x12/0x20
[  115.621174]        [<ffffffff816ffd69>] system_call_fastpath+0x16/0x1b
[  115.621174]
other info that might help us debug this:

[  115.621174]  Possible unsafe locking scenario:

[  115.621174]        CPU0                    CPU1
[  115.621174]        ----                    ----
[  115.621174]   lock(&macvlan_netdev_addr_lock_key);
[  115.621174]                                lock(&vlan_netdev_addr_lock_key/1);
[  115.621174]                                lock(&macvlan_netdev_addr_lock_key);
[  115.621174]   lock(&vlan_netdev_addr_lock_key/1);
[  115.621174]
 *** DEADLOCK ***

[  115.621174] 2 locks held by ip/1704:
[  115.621174]  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff815e6dbb>] rtnetlink_rcv+0x1b/0x40
[  115.621174]  #1:  (&macvlan_netdev_addr_lock_key){+.....}, at: [<ffffffff815da5be>] dev_set_rx_mode+0x1e/0x40
[  115.621174]
stack backtrace:
[  115.621174] CPU: 3 PID: 1704 Comm: ip Not tainted 3.15.0-rc1+ #24
[  115.621174] Hardware name: Hewlett-Packard HP xw8400 Workstation/0A08h, BIOS 786D5 v02.38 10/25/2010
[  115.621174]  ffffffff82339ae0 ffff880465f79568 ffffffff816ee20c ffffffff82339ae0
[  115.621174]  ffff880465f795a8 ffffffff816e9e1b ffff880465f79600 ffff880465b019c8
[  115.621174]  0000000000000001 0000000000000002 ffff880465b019c8 ffff880465b01230
[  115.621174] Call Trace:
[  115.621174]  [<ffffffff816ee20c>] dump_stack+0x4d/0x66
[  115.621174]  [<ffffffff816e9e1b>] print_circular_bug+0x200/0x20e
[  115.621174]  [<ffffffff810d4d43>] __lock_acquire+0x1773/0x1a60
[  115.621174]  [<ffffffff810d3172>] ? trace_hardirqs_on_caller+0xb2/0x1d0
[  115.621174]  [<ffffffff810d57f2>] lock_acquire+0xa2/0x130
[  115.621174]  [<ffffffff815df49c>] ? dev_uc_sync+0x3c/0x80
[  115.621174]  [<ffffffff816f62e7>] _raw_spin_lock_nested+0x37/0x50
[  115.621174]  [<ffffffff815df49c>] ? dev_uc_sync+0x3c/0x80
[  115.621174]  [<ffffffff815df49c>] dev_uc_sync+0x3c/0x80
[  115.621174]  [<ffffffffa0696d2a>] macvlan_set_mac_lists+0xca/0x110 [macvlan]
[  115.621174]  [<ffffffff815da557>] __dev_set_rx_mode+0x57/0xa0
[  115.621174]  [<ffffffff815da5c6>] dev_set_rx_mode+0x26/0x40
[  115.621174]  [<ffffffff815da6be>] __dev_open+0xde/0x140
[  115.621174]  [<ffffffff815da9ad>] __dev_change_flags+0x9d/0x170
[  115.621174]  [<ffffffff815daaa9>] dev_change_flags+0x29/0x60
[  115.621174]  [<ffffffff811e1db1>] ? mem_cgroup_bad_page_check+0x21/0x30
[  115.621174]  [<ffffffff815e7f11>] do_setlink+0x321/0x9a0
[  115.621174]  [<ffffffff810d394c>] ? __lock_acquire+0x37c/0x1a60
[  115.621174]  [<ffffffff815ea59f>] rtnl_newlink+0x51f/0x730
[  115.621174]  [<ffffffff815ea169>] ? rtnl_newlink+0xe9/0x730
[  115.621174]  [<ffffffff815e6e75>] rtnetlink_rcv_msg+0x95/0x250
[  115.621174]  [<ffffffff810d329d>] ? trace_hardirqs_on+0xd/0x10
[  115.621174]  [<ffffffff815e6dbb>] ? rtnetlink_rcv+0x1b/0x40
[  115.621174]  [<ffffffff815e6de0>] ? rtnetlink_rcv+0x40/0x40
[  115.621174]  [<ffffffff81608b19>] netlink_rcv_skb+0xa9/0xc0
[  115.621174]  [<ffffffff815e6dca>] rtnetlink_rcv+0x2a/0x40
[  115.621174]  [<ffffffff81608150>] netlink_unicast+0xf0/0x1c0
[  115.621174]  [<ffffffff8160851f>] netlink_sendmsg+0x2ff/0x740
[  115.621174]  [<ffffffff815bc9db>] sock_sendmsg+0x8b/0xc0
[  115.621174]  [<ffffffff8119d4af>] ? might_fault+0x5f/0xb0
[  115.621174]  [<ffffffff8119d4f8>] ? might_fault+0xa8/0xb0
[  115.621174]  [<ffffffff8119d4af>] ? might_fault+0x5f/0xb0
[  115.621174]  [<ffffffff815cb51e>] ? verify_iovec+0x5e/0xe0
[  115.621174]  [<ffffffff815bd4b9>] ___sys_sendmsg+0x369/0x380
[  115.621174]  [<ffffffff816faa0d>] ? __do_page_fault+0x11d/0x570
[  115.621174]  [<ffffffff810cfe9f>] ? up_read+0x1f/0x40
[  115.621174]  [<ffffffff816fab04>] ? __do_page_fault+0x214/0x570
[  115.621174]  [<ffffffff8120a10b>] ? mntput_no_expire+0x6b/0x1c0
[  115.621174]  [<ffffffff8120a0b7>] ? mntput_no_expire+0x17/0x1c0
[  115.621174]  [<ffffffff8120a284>] ? mntput+0x24/0x40
[  115.621174]  [<ffffffff815bdbb2>] __sys_sendmsg+0x42/0x80
[  115.621174]  [<ffffffff815bdc02>] SyS_sendmsg+0x12/0x20
[  115.621174]  [<ffffffff816ffd69>] system_call_fastpath+0x16/0x1b

Fix this by correctly providing macvlan lockdep class.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-31 13:20:34 -07:00
Vlad Yasevich
102e103f76 vlan: Fix lockdep warning with stacked vlan devices.
[ Upstream commit d38569ab2b ]

This reverts commit dc8eaaa006.
	vlan: Fix lockdep warning when vlan dev handle notification

Instead we use the new new API to find the lock subclass of
our vlan device.  This way we can support configurations where
vlans are interspersed with other devices:
  bond -> vlan -> macvlan -> vlan

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-31 13:20:34 -07:00
Vlad Yasevich
d0166f814a net: Allow for more then a single subclass for netif_addr_lock
[ Upstream commit 25175ba5c9 ]

Currently netif_addr_lock_nested assumes that there can be only
a single nesting level between 2 devices.  However, if we
have multiple devices of the same type stacked, this fails.
For example:
 eth0 <-- vlan0.10 <-- vlan0.10.20

A more complicated configuration may stack more then one type of
device in different order.
Ex:
  eth0 <-- vlan0.10 <-- macvlan0 <-- vlan1.10.20 <-- macvlan1

This patch adds an ndo_* function that allows each stackable
device to report its nesting level.  If the device doesn't
provide this function default subclass of 1 is used.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-31 13:20:34 -07:00
Vlad Yasevich
69ab2a8b80 net: Find the nesting level of a given device by type.
[ Upstream commit 4085ebe8c3 ]

Multiple devices in the kernel can be stacked/nested and they
need to know their nesting level for the purposes of lockdep.
This patch provides a generic function that determines a nesting
level of a particular device by its type (ex: vlan, macvlan, etc).
We only care about nesting of the same type of devices.

For example:
  eth0 <- vlan0.10 <- macvlan0 <- vlan1.20

The nesting level of vlan1.20 would be 1, since there is another vlan
in the stack under it.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-31 13:20:34 -07:00
Daniel Borkmann
bde6d78b4a Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer"
[ Upstream commit 362d52040c ]

This reverts commit ef2820a735 ("net: sctp: Fix a_rwnd/rwnd management
to reflect real state of the receiver's buffer") as it introduced a
serious performance regression on SCTP over IPv4 and IPv6, though a not
as dramatic on the latter. Measurements are on 10Gbit/s with ixgbe NICs.

Current state:

[root@Lab200slot2 ~]# iperf3 --sctp -4 -c 192.168.241.3 -V -l 1452 -t 60
iperf version 3.0.1 (10 January 2014)
Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
Time: Fri, 11 Apr 2014 17:56:21 GMT
Connecting to host 192.168.241.3, port 5201
      Cookie: Lab200slot2.1397238981.812898.548918
[  4] local 192.168.241.2 port 38616 connected to 192.168.241.3 port 5201
Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.09   sec  20.8 MBytes   161 Mbits/sec
[  4]   1.09-2.13   sec  10.8 MBytes  86.8 Mbits/sec
[  4]   2.13-3.15   sec  3.57 MBytes  29.5 Mbits/sec
[  4]   3.15-4.16   sec  4.33 MBytes  35.7 Mbits/sec
[  4]   4.16-6.21   sec  10.4 MBytes  42.7 Mbits/sec
[  4]   6.21-6.21   sec  0.00 Bytes    0.00 bits/sec
[  4]   6.21-7.35   sec  34.6 MBytes   253 Mbits/sec
[  4]   7.35-11.45  sec  22.0 MBytes  45.0 Mbits/sec
[  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
[  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
[  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
[  4]  11.45-12.51  sec  16.0 MBytes   126 Mbits/sec
[  4]  12.51-13.59  sec  20.3 MBytes   158 Mbits/sec
[  4]  13.59-14.65  sec  13.4 MBytes   107 Mbits/sec
[  4]  14.65-16.79  sec  33.3 MBytes   130 Mbits/sec
[  4]  16.79-16.79  sec  0.00 Bytes    0.00 bits/sec
[  4]  16.79-17.82  sec  5.94 MBytes  48.7 Mbits/sec
(etc)

[root@Lab200slot2 ~]#  iperf3 --sctp -6 -c 2001:db8:0:f101::1 -V -l 1400 -t 60
iperf version 3.0.1 (10 January 2014)
Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
Time: Fri, 11 Apr 2014 19:08:41 GMT
Connecting to host 2001:db8:0:f101::1, port 5201
      Cookie: Lab200slot2.1397243321.714295.2b3f7c
[  4] local 2001:db8:0:f101::2 port 55804 connected to 2001:db8:0:f101::1 port 5201
Starting Test: protocol: SCTP, 1 streams, 1400 byte blocks, omitting 0 seconds, 60 second test
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec   169 MBytes  1.42 Gbits/sec
[  4]   1.00-2.00   sec   201 MBytes  1.69 Gbits/sec
[  4]   2.00-3.00   sec   188 MBytes  1.58 Gbits/sec
[  4]   3.00-4.00   sec   174 MBytes  1.46 Gbits/sec
[  4]   4.00-5.00   sec   165 MBytes  1.39 Gbits/sec
[  4]   5.00-6.00   sec   199 MBytes  1.67 Gbits/sec
[  4]   6.00-7.00   sec   163 MBytes  1.36 Gbits/sec
[  4]   7.00-8.00   sec   174 MBytes  1.46 Gbits/sec
[  4]   8.00-9.00   sec   193 MBytes  1.62 Gbits/sec
[  4]   9.00-10.00  sec   196 MBytes  1.65 Gbits/sec
[  4]  10.00-11.00  sec   157 MBytes  1.31 Gbits/sec
[  4]  11.00-12.00  sec   175 MBytes  1.47 Gbits/sec
[  4]  12.00-13.00  sec   192 MBytes  1.61 Gbits/sec
[  4]  13.00-14.00  sec   199 MBytes  1.67 Gbits/sec
(etc)

After patch:

[root@Lab200slot2 ~]#  iperf3 --sctp -4 -c 192.168.240.3 -V -l 1452 -t 60
iperf version 3.0.1 (10 January 2014)
Linux Lab200slot2 3.14.0+ #1 SMP Mon Apr 14 12:06:40 EDT 2014 x86_64
Time: Mon, 14 Apr 2014 16:40:48 GMT
Connecting to host 192.168.240.3, port 5201
      Cookie: Lab200slot2.1397493648.413274.65e131
[  4] local 192.168.240.2 port 50548 connected to 192.168.240.3 port 5201
Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec   240 MBytes  2.02 Gbits/sec
[  4]   1.00-2.00   sec   239 MBytes  2.01 Gbits/sec
[  4]   2.00-3.00   sec   240 MBytes  2.01 Gbits/sec
[  4]   3.00-4.00   sec   239 MBytes  2.00 Gbits/sec
[  4]   4.00-5.00   sec   245 MBytes  2.05 Gbits/sec
[  4]   5.00-6.00   sec   240 MBytes  2.01 Gbits/sec
[  4]   6.00-7.00   sec   240 MBytes  2.02 Gbits/sec
[  4]   7.00-8.00   sec   239 MBytes  2.01 Gbits/sec

With the reverted patch applied, the SCTP/IPv4 performance is back
to normal on latest upstream for IPv4 and IPv6 and has same throughput
as 3.4.2 test kernel, steady and interval reports are smooth again.

Fixes: ef2820a735 ("net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer")
Reported-by: Peter Butler <pbutler@sonusnet.com>
Reported-by: Dongsheng Song <dongsheng.song@gmail.com>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Peter Butler <pbutler@sonusnet.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
Cc: Alexander Sverdlin <alexander.sverdlin@nsn.com>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-31 13:20:33 -07:00
Eric Dumazet
6c09ba984c ipv6: Limit mtu to 65575 bytes
[ Upstream commit 30f78d8ebf ]

Francois reported that setting big mtu on loopback device could prevent
tcp sessions making progress.

We do not support (yet ?) IPv6 Jumbograms and cook corrupted packets.

We must limit the IPv6 MTU to (65535 + 40) bytes in theory.

Tested:

ifconfig lo mtu 70000
netperf -H ::1

Before patch : Throughput :   0.05 Mbits

After patch : Throughput : 35484 Mbits

Reported-by: Francois WELLENREITER <f.wellenreiter@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-31 13:20:33 -07:00
Mel Gorman
402e194dfc mm: use paravirt friendly ops for NUMA hinting ptes
commit 29c7787075 upstream.

David Vrabel identified a regression when using automatic NUMA balancing
under Xen whereby page table entries were getting corrupted due to the
use of native PTE operations.  Quoting him

	Xen PV guest page tables require that their entries use machine
	addresses if the preset bit (_PAGE_PRESENT) is set, and (for
	successful migration) non-present PTEs must use pseudo-physical
	addresses.  This is because on migration MFNs in present PTEs are
	translated to PFNs (canonicalised) so they may be translated back
	to the new MFN in the destination domain (uncanonicalised).

	pte_mknonnuma(), pmd_mknonnuma(), pte_mknuma() and pmd_mknuma()
	set and clear the _PAGE_PRESENT bit using pte_set_flags(),
	pte_clear_flags(), etc.

	In a Xen PV guest, these functions must translate MFNs to PFNs
	when clearing _PAGE_PRESENT and translate PFNs to MFNs when setting
	_PAGE_PRESENT.

His suggested fix converted p[te|md]_[set|clear]_flags to using
paravirt-friendly ops but this is overkill.  He suggested an alternative
of using p[te|md]_modify in the NUMA page table operations but this is
does more work than necessary and would require looking up a VMA for
protections.

This patch modifies the NUMA page table operations to use paravirt
friendly operations to set/clear the flags of interest.  Unfortunately
this will take a performance hit when updating the PTEs on
CONFIG_PARAVIRT but I do not see a way around it that does not break
Xen.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: David Vrabel <david.vrabel@citrix.com>
Tested-by: David Vrabel <david.vrabel@citrix.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steven Noonan <steven@uplinklabs.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-31 13:20:30 -07:00
Peter Zijlstra
4c03d46991 x86,preempt: Fix preemption for i386
Many people reported preemption/reschedule problems with i386 kernels
for .13 and .14. After Michele bisected this to a combination of

  3e8e42c69bb ("sched: Revert need_resched() to look at TIF_NEED_RESCHED")
  ded7975475 ("irq: Force hardirq exit's softirq processing on its own stack")

it finally dawned on me that i386's current_thread_info() was to
blame.

When we are on interrupt/exception stacks, we fail to observe the
right TIF_NEED_RESCHED bit and therefore the PREEMPT_NEED_RESCHED
folding malfunctions.

Current upstream fixes this by making i386 behave the same as x86_64
already did:

  2432e1364b ("x86: Nuke the supervisor_stack field in i386 thread_info")
  b807902a88 ("x86: Nuke GET_THREAD_INFO_WITH_ESP() macro for i386")
  0788aa6a23 ("x86: Prepare removal of previous_esp from i386 thread_info structure")
  198d208df4 ("x86: Keep thread_info on thread stack in x86_32")

However, that is far too much to stuff into -stable. Therefore I
propose we merge the below patch which uses task_thread_info(current)
for tif_need_resched() instead of the ESP based current_thread_info().

This makes sure we always observe the one true TIF_NEED_RESCHED bit
and things will work as expected again.

Cc: bp@alien8.de
Cc: fweisbec@gmail.com
Cc: david.a.cohen@linux.intel.com
Cc: mingo@kernel.org
Cc: fweisbec@gmail.com
Cc: greg@kroah.com
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: gregkh@linuxfoundation.org
Cc: pbonzini@redhat.com
Cc: rostedt@goodmis.org
Cc: stefan.bader@canonical.com
Cc: mingo@kernel.org
Cc: toralf.foerster@gmx.de
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: torvalds@linux-foundation.org
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: <stable@vger.kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: <stable-commits@vger.kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: peterz@infradead.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: barra_cuda@katamail.com
Tested-by: Stefan Bader <stefan.bader@canonical.com>
Tested-by: Toralf F¿rster <toralf.foerster@gmx.de>
Tested-by: Michele Ballabio <barra_cuda@katamail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140409142447.GD13658@twins.programming.kicks-ass.net
2014-05-31 13:20:28 -07:00
Roman Pen
0a8eda9c00 blktrace: fix accounting of partially completed requests
commit af5040da01 upstream.

trace_block_rq_complete does not take into account that request can
be partially completed, so we can get the following incorrect output
of blkparser:

  C   R 232 + 240 [0]
  C   R 240 + 232 [0]
  C   R 248 + 224 [0]
  C   R 256 + 216 [0]

but should be:

  C   R 232 + 8 [0]
  C   R 240 + 8 [0]
  C   R 248 + 8 [0]
  C   R 256 + 8 [0]

Also, the whole output summary statistics of completed requests and
final throughput will be incorrect.

This patch takes into account real completion size of the request and
fixes wrong completion accounting.

Signed-off-by: Roman Pen <r.peniaev@gmail.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: linux-kernel@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-31 13:20:28 -07:00
Andrey Vagin
7a6f558b35 netfilter: nf_conntrack: reserve two bytes for nf_ct_ext->len
commit 223b02d923 upstream.

"len" contains sizeof(nf_ct_ext) and size of extensions. In a worst
case it can contain all extensions. Bellow you can find sizes for all
types of extensions. Their sum is definitely bigger than 256.

nf_ct_ext_types[0]->len = 24
nf_ct_ext_types[1]->len = 32
nf_ct_ext_types[2]->len = 24
nf_ct_ext_types[3]->len = 32
nf_ct_ext_types[4]->len = 152
nf_ct_ext_types[5]->len = 2
nf_ct_ext_types[6]->len = 16
nf_ct_ext_types[7]->len = 8

I have seen "len" up to 280 and my host has crashes w/o this patch.

The right way to fix this problem is reducing the size of the ecache
extension (4) and Florian is going to do this, but these changes will
be quite large to be appropriate for a stable tree.

Fixes: 5b423f6a40 (netfilter: nf_conntrack: fix racy timer handling with reliable)
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-31 13:20:27 -07:00
Patrick McHardy
d6421db1db netfilter: nf_tables: fix nft_cmp_fast failure on big endian for size < 4
commit b855d416dc upstream.

nft_cmp_fast is used for equality comparisions of size <= 4. For
comparisions of size < 4 byte a mask is calculated that is applied to
both the data from userspace (during initialization) and the register
value (during runtime). Both values are stored using (in effect) memcpy
to a memory area that is then interpreted as u32 by nft_cmp_fast.

This works fine on little endian since smaller types have the same base
address, however on big endian this is not true and the smaller types
are interpreted as a big number with trailing zero bytes.

The mask therefore must not include the lower bytes, but the higher bytes
on big endian. Add a helper function that does a cpu_to_le32 to switch
the bytes on big endian. Since we're dealing with a mask of just consequitive
bits, this works out fine.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-31 13:20:27 -07:00
Richard Guy Briggs
6960029959 pid: get pid_t ppid of task in init_pid_ns
commit ad36d28293 upstream.

Added the functions task_ppid_nr_ns() and task_ppid_nr() to abstract the lookup
of the PPID (real_parent's pid_t) of a process, including rcu locking, in the
arbitrary and init_pid_ns.
This provides an alternative to sys_getppid(), which is relative to the child
process' pid namespace.

(informed by ebiederman's 6c621b7e)
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-31 13:20:27 -07:00
James Bottomley
bf1802de75 scsi: fix our current target reap infrastructure
commit e63ed0d7a9 upstream.

This patch eliminates the reap_ref and replaces it with a proper kref.
On last put of this kref, the target is removed from visibility in
sysfs.  The final call to scsi_target_reap() for the device is done from
__scsi_remove_device() and only if the device was made visible.  This
ensures that the target disappears as soon as the last device is gone
rather than waiting until final release of the device (which is often
too long).

Reviewed-by: Alan Stern <stern@rowland.harvard.edu>
Tested-by: Sarah Sharp <sarah.a.sharp@linux.intel.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-31 13:20:27 -07:00
Dan Williams
5c29506449 libata/ahci: accommodate tag ordered controllers
commit 8a4aeec8d2 upstream.

The AHCI spec allows implementations to issue commands in tag order
rather than FIFO order:

	5.3.2.12 P:SelectCmd
	HBA sets pSlotLoc = (pSlotLoc + 1) mod (CAP.NCS + 1)
	or HBA selects the command to issue that has had the
	PxCI bit set to '1' longer than any other command
	pending to be issued.

The result is that commands posted sequentially (time-wise) may play out
of sequence when issued by hardware.

This behavior has likely been hidden by drives that arrange for commands
to complete in issue order.  However, it appears recent drives (two from
different vendors that we have found so far) inflict out-of-order
completions as a matter of course.  So, we need to take care to maintain
ordered submission, otherwise we risk triggering a drive to fall out of
sequential-io automation and back to random-io processing, which incurs
large latency and degrades throughput.

This issue was found in simple benchmarks where QD=2 seq-write
performance was 30-50% *greater* than QD=32 seq-write performance.

Tagging for -stable and making the change globally since it has a low
risk-to-reward ratio.  Also, word is that recent versions of an unnamed
OS also does it this way now.  So, drives in the field are already
experienced with this tag ordering scheme.

Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ed Ciechanowski <ed.ciechanowski@intel.com>
Reviewed-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-13 13:32:51 +02:00
Stephen Warren
73be2e28cf dt: tegra: remove non-existent clock IDs
commit 9ef1af9ea2 upstream.

The Tegra124 clock DT binding currently provides 3 clocks that don't
actually exist; 2 for NAND and one for UART5/UARTE. Delete these. While
this is technically an incompatible DT ABI change, nothing could have
used these clock IDs for anything practical, since the HW doesn't exist.

Signed-off-by: Stephen Warren <swarren@nvidia.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06 07:59:38 -07:00
Martin K. Petersen
b4140e2b76 block: Fix for_each_bvec()
commit b7aa84d9cb upstream.

Commit 4550dd6c6b introduced for_each_bvec() which iterates over each
bvec attached to a bio or bip. However, the macro fails to check bi_size
before dereferencing which can lead to crashes while counting/mapping
integrity scatterlist segments.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06 07:59:36 -07:00
Al Viro
fc7b1646bf smarter propagate_mnt()
commit f2ebb3a921 upstream.

The current mainline has copies propagated to *all* nodes, then
tears down the copies we made for nodes that do not contain
counterparts of the desired mountpoint.  That sets the right
propagation graph for the copies (at teardown time we move
the slaves of removed node to a surviving peer or directly
to master), but we end up paying a fairly steep price in
useless allocations.  It's fairly easy to create a situation
where N calls of mount(2) create exactly N bindings, with
O(N^2) vfsmounts allocated and freed in process.

Fortunately, it is possible to avoid those allocations/freeings.
The trick is to create copies in the right order and find which
one would've eventually become a master with the current algorithm.
It turns out to be possible in O(nodes getting propagation) time
and with no extra allocations at all.

One part is that we need to make sure that eventual master will be
created before its slaves, so we need to walk the propagation
tree in a different order - by peer groups.  And iterate through
the peers before dealing with the next group.

Another thing is finding the (earlier) copy that will be a master
of one we are about to create; to do that we are (temporary) marking
the masters of mountpoints we are attaching the copies to.

Either we are in a peer of the last mountpoint we'd dealt with,
or we have the following situation: we are attaching to mountpoint M,
the last copy S_0 had been attached to M_0 and there are sequences
S_0...S_n, M_0...M_n such that S_{i+1} is a master of S_{i},
S_{i} mounted on M{i} and we need to create a slave of the first S_{k}
such that M is getting propagation from M_{k}.  It means that the master
of M_{k} will be among the sequence of masters of M.  On the
other hand, the nearest marked node in that sequence will either
be the master of M_{k} or the master of M_{k-1} (the latter -
in the case if M_{k-1} is a slave of something M gets propagation
from, but in a wrong peer group).

So we go through the sequence of masters of M until we find
a marked one (P).  Let N be the one before it.  Then we go through
the sequence of masters of S_0 until we find one (say, S) mounted
on a node D that has P as master and check if D is a peer of N.
If it is, S will be the master of new copy, if not - the master of S
will be.

That's it for the hard part; the rest is fairly simple.  Iterator
is in next_group(), handling of one prospective mountpoint is
propagate_one().

It seems to survive all tests and gives a noticably better performance
than the current mainline for setups that are seriously using shared
subtrees.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06 07:59:36 -07:00
Serge Hallyn
5361026441 xattr: guard against simultaneous glibc header inclusion
commit ea1a8217b0 upstream.

If the glibc xattr.h header is included after the uapi header,
compilation fails due to an enum re-using a #define from the uapi
header.

Protect against this by guarding the define and enum inclusions against
each other.

(See https://lists.debian.org/debian-glibc/2014/03/msg00029.html
and https://sourceware.org/glibc/wiki/Synchronizing_Headers
for more information.)

Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Allan McRae <allan@archlinux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06 07:59:35 -07:00
Hans Verkuil
96f4b20a17 media: videodev2.h: add parenthesis around macro arguments
commit aee786acfc upstream.

bt->width should be (bt)->width, and same for the other fields.

Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06 07:59:31 -07:00
Will Deacon
f697f4d7a7 word-at-a-time: avoid undefined behaviour in zero_bytemask macro
commit ec6931b281 upstream.

The asm-generic, big-endian version of zero_bytemask creates a mask of
bytes preceding the first zero-byte by left shifting ~0ul based on the
position of the first zero byte.

Unfortunately, if the first (top) byte is zero, the output of
prep_zero_mask has only the top bit set, resulting in undefined C
behaviour as we shift left by an amount equal to the width of the type.
As it happens, GCC doesn't manage to spot this through the call to fls(),
but the issue remains if architectures choose to implement their shift
instructions differently.

An example would be arch/arm/ (AArch32), where LSL Rd, Rn, #32 results
in Rd == 0x0, whilst on arch/arm64 (AArch64) LSL Xd, Xn, #64 results in
Xd == Xn.

Rather than check explicitly for the problematic shift, this patch adds
an extra shift by 1, replacing fls with __fls. Since zero_bytemask is
never called with a zero argument (has_zero() is used to check the data
first), we don't need to worry about calling __fls(0), which is
undefined.

Cc: Victor Kamensky <victor.kamensky@linaro.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06 07:59:29 -07:00
K. Y. Srinivasan
e1339dc85f Drivers: hv: vmbus: Negotiate version 3.0 when running on ws2012r2 hosts
commit 03367ef5ea upstream.

Only ws2012r2 hosts support the ability to reconnect to the host on VMBUS. This functionality
is needed by kexec in Linux. To use this functionality we need to negotiate version 3.0 of the
VMBUS protocol.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06 07:59:29 -07:00
Moni Shoua
5daac1bcbb IB/core: Don't resolve passive side RoCE L2 address in CMA REQ handler
commit b2853fd6c2 upstream.

The code that resolves the passive side source MAC within the rdma_cm
connection request handler was both redundant and buggy, so remove it.

It was redundant since later, when an RC QP is modified to RTR state,
the resolution will take place in the ib_core module.  It was buggy
because this callback also deals with UD SIDR exchange, for which we
incorrectly looked at the REQ member of the CM event and dereferenced
a random value.

Fixes: dd5f03beb4 ("IB/core: Ethernet L2 attributes in verbs/cm structures")
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06 07:59:28 -07:00
Stanislav Kinsbursky
34ef215eed nfsd: check passed socket's net matches NFSd superblock's one
commit 3064639423 upstream.

There could be a case, when NFSd file system is mounted in network, different
to socket's one, like below:

"ip netns exec" creates new network and mount namespace, which duplicates NFSd
mount point, created in init_net context. And thus NFS server stop in nested
network context leads to RPCBIND client destruction in init_net.
Then, on NFSd start in nested network context, rpc.nfsd process creates socket
in nested net and passes it into "write_ports", which leads to RPCBIND sockets
creation in init_net context because of the same reason (NFSd monut point was
created in init_net context). An attempt to register passed socket in nested
net leads to panic, because no RPCBIND client present in nexted network
namespace.

This patch add check that passed socket's net matches NFSd superblock's one.
And returns -EINVAL error to user psace otherwise.

v2: Put socket on exit.

Reported-by: Weng Meiling <wengmeiling.weng@huawei.com>
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06 07:59:28 -07:00
Jan Kara
1c23ab6f88 bdi: avoid oops on device removal
commit 5acda9d12d upstream.

After commit 839a8e8660 ("writeback: replace custom worker pool
implementation with unbound workqueue") when device is removed while we
are writing to it we crash in bdi_writeback_workfn() ->
set_worker_desc() because bdi->dev is NULL.

This can happen because even though bdi_unregister() cancels all pending
flushing work, nothing really prevents new ones from being queued from
balance_dirty_pages() or other places.

Fix the problem by clearing BDI_registered bit in bdi_unregister() and
checking it before scheduling of any flushing work.

Fixes: 839a8e8660

Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Derek Basehore <dbasehore@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-26 17:19:05 -07:00
Peter Hurley
4f1f4df2c1 tty: Fix low_latency BUG
commit a9c3f68f3c upstream.

The user-settable knob, low_latency, has been the source of
several BUG reports which stem from flush_to_ldisc() running
in interrupt context. Since 3.12, which added several sleeping
locks (termios_rwsem and buf->lock) to the input processing path,
the frequency of these BUG reports has increased.

Note that changes in 3.12 did not introduce this regression;
sleeping locks were first added to the input processing path
with the removal of the BKL from N_TTY in commit
a88a69c912,
'n_tty: Fix loss of echoed characters and remove bkl from n_tty'
and later in commit 38db89799b,
'tty: throttling race fix'. Since those changes, executing
flush_to_ldisc() in interrupt_context (ie, low_latency set), is unsafe.

However, since most devices do not validate if the low_latency
setting is appropriate for the context (process or interrupt) in
which they receive data, some reports are due to misconfiguration.
Further, serial dma devices for which dma fails, resort to
interrupt receiving as a backup without resetting low_latency.

Historically, low_latency was used to force wake-up the reading
process rather than wait for the next scheduler tick. The
effect was to trim multiple milliseconds of latency from
when the process would receive new data.

Recent tests [1] have shown that the reading process now receives
data with only 10's of microseconds latency without low_latency set.

Remove the low_latency rx steering from tty_flip_buffer_push();
however, leave the knob as an optional hint to drivers that can
tune their rx fifos and such like. Cleanup stale code comments
regarding low_latency.

[1] https://lkml.org/lkml/2014/2/20/434

"Yay.. thats an annoying historical pain in the butt gone."
	-- Alan Cox

Reported-by: Beat Bolli <bbolli@ewanet.ch>
Reported-by: Pavel Roskin <proski@gnu.org>
Acked-by: David Sterba <dsterba@suse.cz>
Cc: Grant Edwards <grant.b.edwards@gmail.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Hal Murray <murray+fedora@ip-64-139-1-69.sjc.megapath.net>
Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-26 17:19:04 -07:00
Heiko Carstens
d232ed0c0e futex: Allow architectures to skip futex_atomic_cmpxchg_inatomic() test
commit 03b8c7b623 upstream.

If an architecture has futex_atomic_cmpxchg_inatomic() implemented and there
is no runtime check necessary, allow to skip the test within futex_init().

This allows to get rid of some code which would always give the same result,
and also allows the compiler to optimize a couple of if statements away.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Finn Thain <fthain@telegraphics.com.au>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Link: http://lkml.kernel.org/r/20140302120947.GA3641@osiris
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-14 06:50:05 -07:00
Theodore Ts'o
00a1a053eb ext4: atomically set inode->i_flags in ext4_set_inode_flags()
Use cmpxchg() to atomically set i_flags instead of clearing out the
S_IMMUTABLE, S_APPEND, etc. flags and then setting them from the
EXT4_IMMUTABLE_FL, EXT4_APPEND_FL flags, since this opens up a race
where an immutable file has the immutable flag cleared for a brief
window of time.

Reported-by: John Sullivan <jsrhbz@kanargh.force9.co.uk>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-30 17:02:06 -07:00
Vlad Yasevich
2adb956b08 vlan: Warn the user if lowerdev has bad vlan features.
Some drivers incorrectly assign vlan acceleration features to
vlan_features thus causing issues for Q-in-Q vlan configurations.
Warn the user of such cases.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28 17:16:51 -04:00
Vlad Yasevich
53d6471cef net: Account for all vlan headers in skb_mac_gso_segment
skb_network_protocol() already accounts for multiple vlan
headers that may be present in the skb.  However, skb_mac_gso_segment()
doesn't know anything about it and assumes that skb->mac_len
is set correctly to skip all mac headers.  That may not
always be the case.  If we are simply forwarding the packet (via
bridge or macvtap), all vlan headers may not be accounted for.

A simple solution is to allow skb_network_protocol to return
the vlan depth it has calculated.  This way skb_mac_gso_segment
will correctly skip all mac headers.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28 17:10:36 -04:00
Hannes Frederic Sowa
c15b1ccadb ipv6: move DAD and addrconf_verify processing to workqueue
addrconf_join_solict and addrconf_join_anycast may cause actions which
need rtnl locked, especially on first address creation.

A new DAD state is introduced which defers processing of the initial
DAD processing into a workqueue.

To get rtnl lock we need to push the code paths which depend on those
calls up to workqueues, specifically addrconf_verify and the DAD
processing.

(v2)
addrconf_dad_failure needs to be queued up to the workqueue, too. This
patch introduces a new DAD state and stop the DAD processing in the
workqueue (this is because of the possible ipv6_del_addr processing
which removes the solicited multicast address from the device).

addrconf_verify_lock is removed, too. After the transition it is not
needed any more.

As we are not processing in bottom half anymore we need to be a bit more
careful about disabling bottom half out when we lock spin_locks which are also
used in bh.

Relevant backtrace:
[  541.030090] RTNL: assertion failed at net/core/dev.c (4496)
[  541.031143] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O 3.10.33-1-amd64-vyatta #1
[  541.031145] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[  541.031146]  ffffffff8148a9f0 000000000000002f ffffffff813c98c1 ffff88007c4451f8
[  541.031148]  0000000000000000 0000000000000000 ffffffff813d3540 ffff88007fc03d18
[  541.031150]  0000880000000006 ffff88007c445000 ffffffffa0194160 0000000000000000
[  541.031152] Call Trace:
[  541.031153]  <IRQ>  [<ffffffff8148a9f0>] ? dump_stack+0xd/0x17
[  541.031180]  [<ffffffff813c98c1>] ? __dev_set_promiscuity+0x101/0x180
[  541.031183]  [<ffffffff813d3540>] ? __hw_addr_create_ex+0x60/0xc0
[  541.031185]  [<ffffffff813cfe1a>] ? __dev_set_rx_mode+0xaa/0xc0
[  541.031189]  [<ffffffff813d3a81>] ? __dev_mc_add+0x61/0x90
[  541.031198]  [<ffffffffa01dcf9c>] ? igmp6_group_added+0xfc/0x1a0 [ipv6]
[  541.031208]  [<ffffffff8111237b>] ? kmem_cache_alloc+0xcb/0xd0
[  541.031212]  [<ffffffffa01ddcd7>] ? ipv6_dev_mc_inc+0x267/0x300 [ipv6]
[  541.031216]  [<ffffffffa01c2fae>] ? addrconf_join_solict+0x2e/0x40 [ipv6]
[  541.031219]  [<ffffffffa01ba2e9>] ? ipv6_dev_ac_inc+0x159/0x1f0 [ipv6]
[  541.031223]  [<ffffffffa01c0772>] ? addrconf_join_anycast+0x92/0xa0 [ipv6]
[  541.031226]  [<ffffffffa01c311e>] ? __ipv6_ifa_notify+0x11e/0x1e0 [ipv6]
[  541.031229]  [<ffffffffa01c3213>] ? ipv6_ifa_notify+0x33/0x50 [ipv6]
[  541.031233]  [<ffffffffa01c36c8>] ? addrconf_dad_completed+0x28/0x100 [ipv6]
[  541.031241]  [<ffffffff81075c1d>] ? task_cputime+0x2d/0x50
[  541.031244]  [<ffffffffa01c38d6>] ? addrconf_dad_timer+0x136/0x150 [ipv6]
[  541.031247]  [<ffffffffa01c37a0>] ? addrconf_dad_completed+0x100/0x100 [ipv6]
[  541.031255]  [<ffffffff8105313a>] ? call_timer_fn.isra.22+0x2a/0x90
[  541.031258]  [<ffffffffa01c37a0>] ? addrconf_dad_completed+0x100/0x100 [ipv6]

Hunks and backtrace stolen from a patch by Stephen Hemminger.

Reported-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28 16:54:50 -04:00
Zoltan Kiss
36d5fe6a00 core, nfqueue, openvswitch: Orphan frags in skb_zerocopy and handle errors
skb_zerocopy can copy elements of the frags array between skbs, but it doesn't
orphan them. Also, it doesn't handle errors, so this patch takes care of that
as well, and modify the callers accordingly. skb_tx_error() is also added to
the callers so they will signal the failed delivery towards the creator of the
skb.

Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-27 15:29:38 -04:00