commit 2084140594 upstream.
There are a few subtle races, between change_protection_range (used by
mprotect and change_prot_numa) on one side, and NUMA page migration and
compaction on the other side.
The basic race is that there is a time window between when the PTE gets
made non-present (PROT_NONE or NUMA), and the TLB is flushed.
During that time, a CPU may continue writing to the page.
This is fine most of the time, however compaction or the NUMA migration
code may come in, and migrate the page away.
When that happens, the CPU may continue writing, through the cached
translation, to what is no longer the current memory location of the
process.
This only affects x86, which has a somewhat optimistic pte_accessible.
All other architectures appear to be safe, and will either always flush,
or flush whenever there is a valid mapping, even with no permissions
(SPARC).
The basic race looks like this:
CPU A CPU B CPU C
load TLB entry
make entry PTE/PMD_NUMA
fault on entry
read/write old page
start migrating page
change PTE/PMD to new page
read/write old page [*]
flush TLB
reload TLB from new entry
read/write new page
lose data
[*] the old page may belong to a new user at this point!
The obvious fix is to flush remote TLB entries, by making sure that
pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
still be accessible if there is a TLB flush pending for the mm.
This should fix both NUMA migration and compaction.
[mgorman@suse.de: fix build]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Alex Thorlton <athorlton@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e0acd0a68e upstream.
This is only theoretical, but after try_to_wake_up(p) was changed
to check p->state under p->pi_lock the code like
__set_current_state(TASK_INTERRUPTIBLE);
schedule();
can miss a signal. This is the special case of wait-for-condition,
it relies on try_to_wake_up/schedule interaction and thus it does
not need mb() between __set_current_state() and if(signal_pending).
However, this __set_current_state() can move into the critical
section protected by rq->lock, now that try_to_wake_up() takes
another lock we need to ensure that it can't be reordered with
"if (signal_pending(current))" check inside that section.
The patch is actually one-liner, it simply adds smp_wmb() before
spin_lock_irq(rq->lock). This is what try_to_wake_up() already
does by the same reason.
We turn this wmb() into the new helper, smp_mb__before_spinlock(),
for better documentation and to allow the architectures to change
the default implementation.
While at it, kill smp_mb__after_lock(), it has no callers.
Perhaps we can also add smp_mb__before/after_spinunlock() for
prepare_to_wait().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c3a489cac3 upstream.
The anon_vma lock prevents parallel THP splits and any associated
complexity that arises when handling splits during THP migration. This
patch checks if the lock was successfully acquired and bails from THP
migration if it failed for any reason.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 67f87463d3 upstream.
On x86, PMD entries are similar to _PAGE_PROTNONE protection and are
handled as NUMA hinting faults. The following two page table protection
bits are what defines them
_PAGE_NUMA:set _PAGE_PRESENT:clear
A PMD is considered present if any of the _PAGE_PRESENT, _PAGE_PROTNONE,
_PAGE_PSE or _PAGE_NUMA bits are set. If pmdp_invalidate encounters a
pmd_numa, it clears the present bit leaving _PAGE_NUMA which will be
considered not present by the CPU but present by pmd_present. The
existing caller of pmdp_invalidate should handle it but it's an
inconsistent state for a PMD. This patch keeps the state consistent
when calling pmdp_invalidate.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 13fcca8f25 upstream.
This reverts commit e38c0a1fbc.
Nikita Yushchenko reports:
While trying to make freescale p2020ds and mpc8572ds boards working
with mainline kernel, I faced that commit e38c0a1f (Handle
Both these boards have uli1575 chip.
Corresponding part in device tree is something like
uli1575@0 {
reg = <0x0 0x0 0x0 0x0 0x0>;
#size-cells = <2>;
#address-cells = <3>;
ranges = <0x2000000 0x0 0x80000000
0x2000000 0x0 0x80000000
0x0 0x20000000
0x1000000 0x0 0x0
0x1000000 0x0 0x0
0x0 0x10000>;
isa@1e {
...
I.e. it has #address-cells = <3>
With commit e38c0a1f reverted, devices under uli1575 are registered
correctly, e.g. for rtc
OF: ** translation for device /pcie@ffe09000/pcie@0/uli1575@0/isa@1e/rtc@70 **
OF: bus is isa (na=2, ns=1) on /pcie@ffe09000/pcie@0/uli1575@0/isa@1e
OF: translating address: 00000001 00000070
OF: parent bus is default (na=3, ns=2) on /pcie@ffe09000/pcie@0/uli1575@0
OF: walking ranges...
OF: ISA map, cp=0, s=1000, da=70
OF: parent translation for: 01000000 00000000 00000000
OF: with offset: 70
OF: one level translation: 00000000 00000000 00000070
OF: parent bus is pci (na=3, ns=2) on /pcie@ffe09000/pcie@0
OF: walking ranges...
OF: default map, cp=a0000000, s=20000000, da=70
OF: default map, cp=0, s=10000, da=70
OF: parent translation for: 01000000 00000000 00000000
OF: with offset: 70
OF: one level translation: 01000000 00000000 00000070
OF: parent bus is pci (na=3, ns=2) on /pcie@ffe09000
OF: walking ranges...
OF: PCI map, cp=0, s=10000, da=70
OF: parent translation for: 01000000 00000000 00000000
OF: with offset: 70
OF: one level translation: 01000000 00000000 00000070
OF: parent bus is default (na=2, ns=2) on /
OF: walking ranges...
OF: PCI map, cp=0, s=10000, da=70
OF: parent translation for: 00000000 ffc10000
OF: with offset: 70
OF: one level translation: 00000000 ffc10070
OF: reached root node
With commit e38c0a1f in place, address translation fails:
OF: ** translation for device /pcie@ffe09000/pcie@0/uli1575@0/isa@1e/rtc@70 **
OF: bus is isa (na=2, ns=1) on /pcie@ffe09000/pcie@0/uli1575@0/isa@1e
OF: translating address: 00000001 00000070
OF: parent bus is default (na=3, ns=2) on /pcie@ffe09000/pcie@0/uli1575@0
OF: walking ranges...
OF: ISA map, cp=0, s=1000, da=70
OF: parent translation for: 01000000 00000000 00000000
OF: with offset: 70
OF: one level translation: 00000000 00000000 00000070
OF: parent bus is pci (na=3, ns=2) on /pcie@ffe09000/pcie@0
OF: walking ranges...
OF: default map, cp=a0000000, s=20000000, da=70
OF: default map, cp=0, s=10000, da=70
OF: not found !
Thierry Reding confirmed this commit was not needed after all:
"We ended up merging a different address representation for Tegra PCIe
and I've confirmed that reverting this commit doesn't cause any obvious
regressions. I think all other drivers in drivers/pci/host ended up
copying what we did on Tegra, so I wouldn't expect any other breakage
either."
There doesn't appear to be a simple way to support both behaviours, so
reverting this as nothing should be depending on the new behaviour.
Signed-off-by: Rob Herring <robh@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit bd02cd2549 upstream.
Evan Huus found (by fuzzing in wireshark) that the radiotap
iterator code can access beyond the length of the buffer if
the first bitmap claims an extension but then there's no
data at all. Fix this.
Reported-by: Evan Huus <eapache@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 85fbd722ad upstream.
Freezable kthreads and workqueues are fundamentally problematic in
that they effectively introduce a big kernel lock widely used in the
kernel and have already been the culprit of several deadlock
scenarios. This is the latest occurrence.
During resume, libata rescans all the ports and revalidates all
pre-existing devices. If it determines that a device has gone
missing, the device is removed from the system which involves
invalidating block device and flushing bdi while holding driver core
layer locks. Unfortunately, this can race with the rest of device
resume. Because freezable kthreads and workqueues are thawed after
device resume is complete and block device removal depends on
freezable workqueues and kthreads (e.g. bdi_wq, jbd2) to make
progress, this can lead to deadlock - block device removal can't
proceed because kthreads are frozen and kthreads can't be thawed
because device resume is blocked behind block device removal.
839a8e8660 ("writeback: replace custom worker pool implementation
with unbound workqueue") made this particular deadlock scenario more
visible but the underlying problem has always been there - the
original forker task and jbd2 are freezable too. In fact, this is
highly likely just one of many possible deadlock scenarios given that
freezer behaves as a big kernel lock and we don't have any debug
mechanism around it.
I believe the right thing to do is getting rid of freezable kthreads
and workqueues. This is something fundamentally broken. For now,
implement a funny workaround in libata - just avoid doing block device
hot[un]plug while the system is frozen. Kernel engineering at its
finest. :(
v2: Add EXPORT_SYMBOL_GPL(pm_freezing) for cases where libata is built
as a module.
v3: Comment updated and polling interval changed to 10ms as suggested
by Rafael.
v4: Add #ifdef CONFIG_FREEZER around the hack as pm_freezing is not
defined when FREEZER is not configured thus breaking build.
Reported by kbuild test robot.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Tomaž Šolc <tomaz.solc@tablix.org>
Reviewed-by: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=62801
Link: http://lkml.kernel.org/r/20131213174932.GA27070@htj.dyndns.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 966fbe193f upstream.
Some device require DMADIR to be enabled, but are not detected as such
by atapi_id_dmadir. One such example is "Asus Serillel 2"
SATA-host-to-PATA-device bridge: the bridge itself requires DMADIR,
even if the bridged device does not.
As atapi_dmadir module parameter can cause problems with some devices
(as per Tejun Heo's memory), enabling it globally may not be possible
depending on the hardware.
This patch adds atapi_dmadir in the form of a "force" horkage value,
allowing global, per-bus and per-device control.
Signed-off-by: Vincent Pelletier <plr.vincent@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 87809942d3 upstream.
We've received multiple reports in Fedora via (BZ 907193)
that the Seagate Momentus SpinPoint M8 errors out when enabling AA:
[ 2.555905] ata2.00: failed to enable AA (error_mask=0x1)
[ 2.568482] ata2.00: failed to enable AA (error_mask=0x1)
Add the ATA_HORKAGE_BROKEN_FPDMA_AA for this specific harddisk.
Reported-by: Nicholas <arealityfarbetween@googlemail.com>
Signed-off-by: Michele Baldessari <michele@acksyn.org>
Tested-by: Nicholas <arealityfarbetween@googlemail.com>
Acked-by: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f447ef4a56 upstream.
If a user calls 'cpupower set --perf-bias 15', the process will end with
a SIGSEGV in libc because cpupower-set passes a NULL optarg to the atoi
call. This is because the getopt_long structure currently has all of
the options as having an optional_argument when they really have a
required argument. We change the structure to use required_argument to
match the short options and it resolves the issue.
This fixes https://bugzilla.redhat.com/show_bug.cgi?id=1000439
Signed-off-by: Josh Boyer <jwboyer@fedoraproject.org>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Thomas Renninger <trenn@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 286e4f90a7 upstream.
p_end is an 8 byte value embedded in the text section. This means it
is only 4 byte aligned when it should be 8 byte aligned. Fix this
by adding an explicit alignment.
This fixes an issue where POWER7 little endian builds with
CONFIG_RELOCATABLE=y fail to boot.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 90ff5d688e upstream.
In EXCEPTION_PROLOG_COMMON() we check to see if the stack pointer (r1)
is valid when coming from the kernel. If it's not valid, we die but
with a nice oops message.
Currently we allocate a stack frame (subtract INT_FRAME_SIZE) before we
check to see if the stack pointer is negative. Unfortunately, this
won't detect a bad stack where r1 is less than INT_FRAME_SIZE.
This patch fixes the check to compare the modified r1 with
-INT_FRAME_SIZE. With this, bad kernel stack pointers (including NULL
pointers) are correctly detected again.
Kudos to Paulus for finding this.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e66d2ae7c6 upstream.
Update arch.apic_base before triggering recalculate_apic_map. Otherwise
the recalculation will work against the previous state of the APIC and
will fail to build the correct map when an APIC is hardware-enabled
again.
This fixes a regression of 1e08ec4a13.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 657eb17d87 upstream.
Pick the MAC address of the first virtual interface as the new hardware MAC
address. Set BSSID mask according to this MAC address. This fixes CVE-2013-4579.
Signed-off-by: Mathy Vanhoef <vanhoefm@gmail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 73f0b56a1f upstream.
This patch adds a driver workaround for a HW issue.
A race condition in the HW results in missing interrupts,
which can be avoided by a read/write with the ISR register.
All chips in the AR9002 series are affected by this bug - AR9003
and above do not have this problem.
Cc: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: Sujith Manoharan <c_manoha@qca.qualcomm.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4263c86dca upstream.
Certain dm962x revisions contain an bug, where if a USB bulk transfer retry
(E.G. if bulk crc mismatch) happens right after a transfer with odd or
maxpacket length, the internal tx hardware fifo gets out of sync causing
the interface to stop working.
Work around it by adding up to 3 bytes of padding to ensure this situation
cannot trigger.
This workaround also means we never pass multiple-of-maxpacket size skb's
to usbnet, so the length adjustment to handle usbnet's padding of those can
be removed.
Reported-by: Joseph Chang <joseph_chang@davicom.com.tw>
Signed-off-by: Peter Korsgaard <peter@korsgaard.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 375679104a upstream.
The current driver assumes that an skb fragment can only be upto jumbo
size. Presumably this was a fast-path optimization. This assumption is
no longer true as fragments can be upto 32k.
v2: Remove unnecessary parantheses per Eric Dumazet.
Signed-off-by: Nithin Nayak Sujir <nsujir@broadcom.com>
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 56f91aad69 upstream.
If the length of data to be read in readpage() is exactly
PAGE_CACHE_SIZE, the original code does not flush d-cache
for data consistency after finishing reading. This patches fixes
this.
Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a885b3ccc7 upstream.
The GMCH_CTRL register (or MGCC in the spec) is at a different address
on Sandybridge, and the address to which we currently write to is
undefined. These stray writes appear to upset (hard hang) my Ivybridge
machine whilst it is in UEFI mode.
Note that the register is still marked as locked RO on Sandybridge, so
vgaarb is still dysfunctional.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Reviewed-by: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6c719faca2 upstream.
The update is horribly racy since it doesn't protect at all against
concurrent closing of the master fd. And it can't really since that
requires us to grab a mutex.
Instead of jumping through hoops and offloading this to a worker
thread just block this bit of code for the modesetting driver.
Note that the race is fairly easy to hit since we call the breadcrumb
function for any interrupt. So the vblank interrupt (which usually
keeps going for a bit) is enough. But even if we'd block this and only
update the breadcrumb for user interrupts from the CS we could hit
this race with kms/gem userspace: If a non-master is waiting somewhere
(and hence has interrupts enabled) and the master closes its fd
(probably due to crashing).
v2: Add a code comment to explain why fixing this for real isn't
really worth it. Also improve the commit message a bit.
v3: Fix the spelling in the comment.
Reported-by: Eugene Shatokhin <eugene.shatokhin@rosalab.ru>
Cc: Eugene Shatokhin <eugene.shatokhin@rosalab.ru>
Acked-by: Chris Wilson <chris@chris-wilson.co.uk>
Tested-by: Eugene Shatokhin <eugene.shatokhin@rosalab.ru>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0274766428 upstream.
Some lower level things get angry if we don't have modeset locks
during intel_modeset_setup_hw_state(). Actually the resume and
lid_notify codepaths alreday hold the locks, but the init codepath
doesn't, so fix that.
Note: This slipped through since we only disable pipes if the
plane/pipe linking doesn't match. Which is only relevant on older
gen3 mobile machines, if the BIOS fails to set up our preferred
linking.
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Tested-and-reported-by: Paul Bolle <pebolle@tiscali.nl>
[danvet: Add note now that I could confirm my theory with the log
files Paul Bolle provided.]
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7787380336 upstream.
net_dma can cause data to be copied to a stale mapping if a
copy-on-write fault occurs during dma. The application sees missing
data.
The following trace is triggered by modifying the kernel to WARN if it
ever triggers copy-on-write on a page that is undergoing dma:
WARNING: CPU: 24 PID: 2529 at lib/dma-debug.c:485 debug_dma_assert_idle+0xd2/0x120()
ioatdma 0000:00:04.0: DMA-API: cpu touching an active dma mapped page [pfn=0x16bcd9]
Modules linked in: iTCO_wdt iTCO_vendor_support ioatdma lpc_ich pcspkr dca
CPU: 24 PID: 2529 Comm: linbug Tainted: G W 3.13.0-rc1+ #353
00000000000001e5 ffff88016f45f688 ffffffff81751041 ffff88017ab0ef70
ffff88016f45f6d8 ffff88016f45f6c8 ffffffff8104ed9c ffffffff810f3646
ffff8801768f4840 0000000000000282 ffff88016f6cca10 00007fa2bb699349
Call Trace:
[<ffffffff81751041>] dump_stack+0x46/0x58
[<ffffffff8104ed9c>] warn_slowpath_common+0x8c/0xc0
[<ffffffff810f3646>] ? ftrace_pid_func+0x26/0x30
[<ffffffff8104ee86>] warn_slowpath_fmt+0x46/0x50
[<ffffffff8139c062>] debug_dma_assert_idle+0xd2/0x120
[<ffffffff81154a40>] do_wp_page+0xd0/0x790
[<ffffffff811582ac>] handle_mm_fault+0x51c/0xde0
[<ffffffff813830b9>] ? copy_user_enhanced_fast_string+0x9/0x20
[<ffffffff8175fc2c>] __do_page_fault+0x19c/0x530
[<ffffffff8175c196>] ? _raw_spin_lock_bh+0x16/0x40
[<ffffffff810f3539>] ? trace_clock_local+0x9/0x10
[<ffffffff810fa1f4>] ? rb_reserve_next_event+0x64/0x310
[<ffffffffa0014c00>] ? ioat2_dma_prep_memcpy_lock+0x60/0x130 [ioatdma]
[<ffffffff8175ffce>] do_page_fault+0xe/0x10
[<ffffffff8175c862>] page_fault+0x22/0x30
[<ffffffff81643991>] ? __kfree_skb+0x51/0xd0
[<ffffffff813830b9>] ? copy_user_enhanced_fast_string+0x9/0x20
[<ffffffff81388ea2>] ? memcpy_toiovec+0x52/0xa0
[<ffffffff8164770f>] skb_copy_datagram_iovec+0x5f/0x2a0
[<ffffffff8169d0f4>] tcp_rcv_established+0x674/0x7f0
[<ffffffff816a68c5>] tcp_v4_do_rcv+0x2e5/0x4a0
[..]
---[ end trace e30e3b01191b7617 ]---
Mapped at:
[<ffffffff8139c169>] debug_dma_map_page+0xb9/0x160
[<ffffffff8142bf47>] dma_async_memcpy_pg_to_pg+0x127/0x210
[<ffffffff8142cce9>] dma_memcpy_pg_to_iovec+0x119/0x1f0
[<ffffffff81669d3c>] dma_skb_copy_datagram_iovec+0x11c/0x2b0
[<ffffffff8169d1ca>] tcp_rcv_established+0x74a/0x7f0:
...the problem is that the receive path falls back to cpu-copy in
several locations and this trace is just one of the areas. A few
options were considered to fix this:
1/ sync all dma whenever a cpu copy branch is taken
2/ modify the page fault handler to hold off while dma is in-flight
Option 1 adds yet more cpu overhead to an "offload" that struggles to compete
with cpu-copy. Option 2 adds checks for behavior that is already documented as
broken when using get_user_pages(). At a minimum a debug mode is warranted to
catch and flag these violations of the dma-api vs get_user_pages().
Thanks to David for his reproducer.
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Vinod Koul <vinod.koul@intel.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Reported-by: David Whipple <whipple@securedatainnovations.ch>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ce027ed98f upstream.
Commit 54b2b50c20 "[SCSI] Disable WRITE SAME for RAID and virtual
host adapter drivers" disabled WRITE SAME support for all SBP-2 attached
targets. But as described in the changelog of commit b0ea5f19d3
"firewire: sbp2: allow WRITE SAME and REPORT SUPPORTED OPERATION CODES",
it is not required to blacklist WRITE SAME.
Bring the feature back by reverting the sbp2.c hunk of commit 54b2b50c20.
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 757dfcaa41 upstream.
This patch touches the RT group scheduling case.
Functions inc_rt_prio_smp() and dec_rt_prio_smp() change (global) rq's
priority, while rt_rq passed to them may be not the top-level rt_rq.
This is wrong, because changing of priority on a child level does not
guarantee that the priority is the highest all over the rq. So, this
leak makes RT balancing unusable.
The short example: the task having the highest priority among all rq's
RT tasks (no one other task has the same priority) are waking on a
throttle rt_rq. The rq's cpupri is set to the task's priority
equivalent, but real rq->rt.highest_prio.curr is less.
The patch below fixes the problem.
Signed-off-by: Kirill Tkhai <tkhai@yandex.ru>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
CC: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/49231385567953@web4m.yandex.ru
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8f9ff18920 upstream.
When using FITRIM ioctl on a file system without journal it will
only trim the block group once, no matter how many times you invoke
FITRIM ioctl and how many block you release from the block group.
It is because we only clear EXT4_GROUP_INFO_WAS_TRIMMED_BIT in journal
callback. Fix this by clearing the bit in no journal mode as well.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: Jorge Fábregas <jorge.fabregas@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f5a44db5d2 upstream.
The missing casts can cause the high 64-bits of the physical blocks to
be lost. Set up new macros which allows us to make sure the right
thing happen, even if at some point we end up supporting larger
logical block numbers.
Thanks to the Emese Revfy and the PaX security team for reporting this
issue.
Reported-by: PaX Team <pageexec@freemail.hu>
Reported-by: Emese Revfy <re.emese@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 34cf865d54 upstream.
Akira-san has been reporting rare deadlocks of his machine when running
xfstests test 269 on ext4 filesystem. The problem turned out to be in
ext4_da_reserve_metadata() and ext4_da_reserve_space() which called
ext4_should_retry_alloc() while holding i_data_sem. Since
ext4_should_retry_alloc() can force a transaction commit, this is a
lock ordering violation and leads to deadlocks.
Fix the problem by just removing the retry loops. These functions should
just report ENOSPC to the caller (e.g. ext4_da_write_begin()) and that
function must take care of retrying after dropping all necessary locks.
Reported-and-tested-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 30fac0f75d upstream.
When the filesystem doesn't support extents (like in ext2/3
compatibility modes), there is no need to reserve any clusters. Space
estimates for writing are exact, hole punching doesn't need new
metadata, and there are no unwritten extents to convert.
This fixes a problem when filesystem still having some free space when
accessed with a native ext2/3 driver suddently reports ENOSPC when
accessed with ext4 driver.
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5946d08937 upstream.
A corrupted ext4 may have out of order leaf extents, i.e.
extent: lblk 0--1023, len 1024, pblk 9217, flags: LEAF UNINIT
extent: lblk 1000--2047, len 1024, pblk 10241, flags: LEAF UNINIT
^^^^ overlap with previous extent
Reading such extent could hit BUG_ON() in ext4_es_cache_extent().
BUG_ON(end < lblk);
The problem is that __read_extent_tree_block() tries to cache holes as
well but assumes 'lblk' is greater than 'prev' and passes underflowed
length to ext4_es_cache_extent(). Fix it by checking for overlapping
extents in ext4_valid_extent_entries().
I hit this when fuzz testing ext4, and am able to reproduce it by
modifying the on-disk extent by hand.
Also add the check for (ee_block + len - 1) in ext4_valid_extent() to
make sure the value is not overflow.
Ran xfstests on patched ext4 and no regression.
Cc: Lukáš Czerner <lczerner@redhat.com>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ae1495b12d upstream.
While it's true that errors can only happen if there is a bug in
jbd2_journal_dirty_metadata(), if a bug does happen, we need to halt
the kernel or remount the file system read-only in order to avoid
further data loss. The ext4_journal_abort_handle() function doesn't
do any of this, and while it's likely that this call (since it doesn't
adjust refcounts) will likely result in the file system eventually
deadlocking since the current transaction will never be able to close,
it's much cleaner to call let ext4's error handling system deal with
this situation.
There's a separate bug here which is that if certain jbd2 errors
errors occur and file system is mounted errors=continue, the file
system will probably eventually end grind to a halt as described
above. But things have been this way in a long time, and usually when
we have these sorts of errors it's pretty much a disaster --- and
that's why the jbd2 layer aggressively retries memory allocations,
which is the most likely cause of these jbd2 errors.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 40e2d7f9b5 upstream.
Linux 3.10 changed the timing of how thread_info->flags is touched:
x86: Use generic idle loop
(7d1a941731)
This caused Intel NHM-EX and WSM-EX servers to experience a large number
of immediate MONITOR/MWAIT break wakeups, which caused cpuidle to demote
from deep C-states to shallow C-states, which caused these platforms
to experience a significant increase in idle power.
Note that this issue was already present before the commit above,
however, it wasn't seen often enough to be noticed in power measurements.
Here we extend an errata workaround from the Core2 EX "Dunnington"
to extend to NHM-EX and WSM-EX, to prevent these immediate
returns from MWAIT, reducing idle power on these platforms.
While only acpi_idle ran on Dunnington, intel_idle
may also run on these two newer systems.
As of today, there are no other models that are known
to need this tweak.
Link: http://lkml.kernel.org/r/CAJvTdK=%2BaNN66mYpCGgbHGCHhYQAKx-vB0kJSWjVpsNb_hOAtQ@mail.gmail.com
Signed-off-by: Len Brown <len.brown@intel.com>
Link: http://lkml.kernel.org/r/baff264285f6e585df757d58b17788feabc68918.1387403066.git.len.brown@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6d4c883047 upstream.
Commit 7d7e1eb (ARM: OMAP2+: Prepare for irqs.h removal) and commit
ec2c082 (ARM: OMAP2+: Remove hardcoded IRQs and enable SPARSE_IRQ)
updated the way interrupts for OMAP2/3 devices are defined in the
HWMOD data structures to being an index plus a fixed offset (defined
by OMAP_INTC_START).
Couple of irqs in the OMAP2/3 hwmod data were misconfigured completely
as they were missing this OMAP_INTC_START relative offset. Add this
offset back to fix the incorrect irq data for the following modules:
OMAP2 - GPMC, RNG
OMAP3 - GPMC, ISP MMU & IVA MMU
Signed-off-by: Suman Anna <s-anna@ti.com>
Fixes: 7d7e1eba7e ("ARM: OMAP2+: Prepare for irqs.h removal")
Fixes: ec2c0825ca ("ARM: OMAP2+: Remove hardcoded IRQs and enable SPARSE_IRQ")
Cc: Tony Lindgren <tony@atomide.com>
Signed-off-by: Paul Walmsley <paul@pwsan.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4ecf7ccb19 upstream.
An exclusive store instruction may fail for reasons other than lock
contention (e.g. a cache eviction during the critical section) so, in
line with other architectures using similar exclusive instructions
(alpha, mips, powerpc), retry the trylock operation if the lock appears
to be free but the strex reported failure.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Tony Thompson <anthony.thompson@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Mark Brown <broonie@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit cdc27c2784 upstream.
Commit 8f34a1da35 ("arm64: ptrace: use HW_BREAKPOINT_EMPTY type for
disabled breakpoints") fixed an issue with GDB trying to zero breakpoint
control registers. The problem there is that the arch hw_breakpoint code
will attempt to create a (disabled), execute breakpoint of length 0.
This will fail validation and report unexpected failure to GDB. To avoid
this, we treated disabled breakpoints as HW_BREAKPOINT_EMPTY, but that
seems to have broken with recent kernels, causing watchpoints to be
treated as TYPE_INST in the core code and returning ENOSPC for any
further breakpoints.
This patch fixes the problem by prioritising the `enable' field of the
breakpoint: if it is cleared, we simply update the perf_event_attr to
indicate that the thing is disabled and don't bother changing either the
type or the length. This reinforces the behaviour that the breakpoint
control register is essentially read-only apart from the enable bit
when disabling a breakpoint.
Reported-by: Aaron Liu <liucy214@gmail.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c4602c1c81 upstream.
Ftrace currently initializes only the online CPUs. This implementation has
two problems:
- If we online a CPU after we enable the function profile, and then run the
test, we will lose the trace information on that CPU.
Steps to reproduce:
# echo 0 > /sys/devices/system/cpu/cpu1/online
# cd <debugfs>/tracing/
# echo <some function name> >> set_ftrace_filter
# echo 1 > function_profile_enabled
# echo 1 > /sys/devices/system/cpu/cpu1/online
# run test
- If we offline a CPU before we enable the function profile, we will not clear
the trace information when we enable the function profile. It will trouble
the users.
Steps to reproduce:
# cd <debugfs>/tracing/
# echo <some function name> >> set_ftrace_filter
# echo 1 > function_profile_enabled
# run test
# cat trace_stat/function*
# echo 0 > /sys/devices/system/cpu/cpu1/online
# echo 0 > function_profile_enabled
# echo 1 > function_profile_enabled
# cat trace_stat/function*
# run test
# cat trace_stat/function*
So it is better that we initialize the ftrace profiler for each possible cpu
every time we enable the function profile instead of just the online ones.
Link: http://lkml.kernel.org/r/1387178401-10619-1-git-send-email-miaox@cn.fujitsu.com
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>