commit 06bdadd763 upstream.
audit_syscall_exit() saves a result of regs_return_value() in intermediate
"int" variable and passes it to __audit_syscall_exit(), which expects its
second argument as a "long" value. This will result in truncating the
value returned by a system call and making a wrong audit record.
I don't know why gcc compiler doesn't complain about this, but anyway it
causes a problem at runtime on arm64 (and probably most 64-bit archs).
Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
commit 51c71a3bba upstream.
The user has the option of disabling the platform driver:
00:02.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)
which is used to unplug the emulated drivers (IDE, Realtek 8169, etc)
and allow the PV drivers to take over. If the user wishes
to disable that they can set:
xen_platform_pci=0
(in the guest config file)
or
xen_emul_unplug=never
(on the Linux command line)
except it does not work properly. The PV drivers still try to
load and since the Xen platform driver is not run - and it
has not initialized the grant tables, most of the PV drivers
stumble upon:
input: Xen Virtual Keyboard as /devices/virtual/input/input5
input: Xen Virtual Pointer as /devices/virtual/input/input6M
------------[ cut here ]------------
kernel BUG at /home/konrad/ssd/konrad/linux/drivers/xen/grant-table.c:1206!
invalid opcode: 0000 [#1] SMP
Modules linked in: xen_kbdfront(+) xenfs xen_privcmd
CPU: 6 PID: 1389 Comm: modprobe Not tainted 3.13.0-rc1upstream-00021-ga6c892b-dirty #1
Hardware name: Xen HVM domU, BIOS 4.4-unstable 11/26/2013
RIP: 0010:[<ffffffff813ddc40>] [<ffffffff813ddc40>] get_free_entries+0x2e0/0x300
Call Trace:
[<ffffffff8150d9a3>] ? evdev_connect+0x1e3/0x240
[<ffffffff813ddd0e>] gnttab_grant_foreign_access+0x2e/0x70
[<ffffffffa0010081>] xenkbd_connect_backend+0x41/0x290 [xen_kbdfront]
[<ffffffffa0010a12>] xenkbd_probe+0x2f2/0x324 [xen_kbdfront]
[<ffffffff813e5757>] xenbus_dev_probe+0x77/0x130
[<ffffffff813e7217>] xenbus_frontend_dev_probe+0x47/0x50
[<ffffffff8145e9a9>] driver_probe_device+0x89/0x230
[<ffffffff8145ebeb>] __driver_attach+0x9b/0xa0
[<ffffffff8145eb50>] ? driver_probe_device+0x230/0x230
[<ffffffff8145eb50>] ? driver_probe_device+0x230/0x230
[<ffffffff8145cf1c>] bus_for_each_dev+0x8c/0xb0
[<ffffffff8145e7d9>] driver_attach+0x19/0x20
[<ffffffff8145e260>] bus_add_driver+0x1a0/0x220
[<ffffffff8145f1ff>] driver_register+0x5f/0xf0
[<ffffffff813e55c5>] xenbus_register_driver_common+0x15/0x20
[<ffffffff813e76b3>] xenbus_register_frontend+0x23/0x40
[<ffffffffa0015000>] ? 0xffffffffa0014fff
[<ffffffffa001502b>] xenkbd_init+0x2b/0x1000 [xen_kbdfront]
[<ffffffff81002049>] do_one_initcall+0x49/0x170
.. snip..
which is hardly nice. This patch fixes this by having each
PV driver check for:
- if running in PV, then it is fine to execute (as that is their
native environment).
- if running in HVM, check if user wanted 'xen_emul_unplug=never',
in which case bail out and don't load any PV drivers.
- if running in HVM, and if PCI device 5853:0001 (xen_platform_pci)
does not exist, then bail out and not load PV drivers.
- (v2) if running in HVM, and if the user wanted 'xen_emul_unplug=ide-disks',
then bail out for all PV devices _except_ the block one.
Ditto for the network one ('nics').
- (v2) if running in HVM, and if the user wanted 'xen_emul_unplug=unnecessary'
then load block PV driver, and also setup the legacy IDE paths.
In (v3) make it actually load PV drivers.
Reported-by: Sander Eikelenboom <linux@eikelenboom.it
Reported-by: Anthony PERARD <anthony.perard@citrix.com>
Reported-and-Tested-by: Fabio Fantoni <fabio.fantoni@m2r.biz>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[v2: Add extra logic to handle the myrid ways 'xen_emul_unplug'
can be used per Ian and Stefano suggestion]
[v3: Make the unnecessary case work properly]
[v4: s/disks/ide-disks/ spotted by Fabio]
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com> [for PCI parts]
[ kamal: backport to 3.8 (no drivers/char/tpm/xen-tpmfront.c) ]
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
[ Upstream commit 2205369a31 ]
When the vlan code detects that the real device can do TX VLAN offloads
in hardware, it tries to arrange for the real device's header_ops to
be invoked directly.
But it does so illegally, by simply hooking the real device's
header_ops up to the VLAN device.
This doesn't work because we will end up invoking a set of header_ops
routines which expect a device type which matches the real device, but
will see a VLAN device instead.
Fix this by providing a pass-thru set of header_ops which will arrange
to pass the proper real device instead.
To facilitate this add a dev_rebuild_header(). There are
implementations which provide a ->cache and ->create but not a
->rebuild (f.e. PLIP). So we need a helper function just like
dev_hard_header() to avoid crashes.
Use this helper in the one existing place where the
header_ops->rebuild was being invoked, the neighbour code.
With lots of help from Florian Westphal.
Signed-off-by: David S. Miller <davem@davemloft.net>
[ kamal: backport to 3.8 (context) ]
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
[ Upstream commit 12663bfc97 ]
unix_dgram_recvmsg() will hold the readlock of the socket until recv
is complete.
In the same time, we may try to setsockopt(SO_PEEK_OFF) which will hang until
unix_dgram_recvmsg() will complete (which can take a while) without allowing
us to break out of it, triggering a hung task spew.
Instead, allow set_peek_off to fail, this way userspace will not hang.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
commit 439a1cfffe upstream.
This will allow userspace to correctly program the PA_SC_RASTER_CONFIG
register, so it can be considered a fix.
Signed-off-by: Marek Olšák <marek.olsak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[ kamal: backport to 3.8 (SI only, no CIK in 3.8; context) ]
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
commit 95cadace8f upstream.
This patch allows FILEIO to update hw_max_sectors based on the current
max_bytes_per_io. This is required because vfs_[writev,readv]() can accept
a maximum of 2048 iovecs per call, so the enforced hw_max_sectors really
needs to be calculated based on block_size.
This addresses a >= v3.5 bug where block_size=512 was rejecting > 1M
sized I/O requests, because FD_MAX_SECTORS was hardcoded to 2048 for
the block_size=4096 case.
(v2: Use max_bytes_per_io instead of ->update_hw_max_sectors)
Reported-by: Henrik Goldman <hg@x-formation.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
[ kamal: backport to 3.8 (context) ]
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
commit 2084140594 upstream.
There are a few subtle races, between change_protection_range (used by
mprotect and change_prot_numa) on one side, and NUMA page migration and
compaction on the other side.
The basic race is that there is a time window between when the PTE gets
made non-present (PROT_NONE or NUMA), and the TLB is flushed.
During that time, a CPU may continue writing to the page.
This is fine most of the time, however compaction or the NUMA migration
code may come in, and migrate the page away.
When that happens, the CPU may continue writing, through the cached
translation, to what is no longer the current memory location of the
process.
This only affects x86, which has a somewhat optimistic pte_accessible.
All other architectures appear to be safe, and will either always flush,
or flush whenever there is a valid mapping, even with no permissions
(SPARC).
The basic race looks like this:
CPU A CPU B CPU C
load TLB entry
make entry PTE/PMD_NUMA
fault on entry
read/write old page
start migrating page
change PTE/PMD to new page
read/write old page [*]
flush TLB
reload TLB from new entry
read/write new page
lose data
[*] the old page may belong to a new user at this point!
The obvious fix is to flush remote TLB entries, by making sure that
pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
still be accessible if there is a TLB flush pending for the mm.
This should fix both NUMA migration and compaction.
[mgorman@suse.de: fix build]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Alex Thorlton <athorlton@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
commit de466bd628 upstream.
do_huge_pmd_numa_page() handles the case where there is parallel THP
migration. However, by the time it is checked the NUMA hinting
information has already been disrupted. This patch adds an earlier
check with some helpers.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
commit 6733cf572a upstream.
snd_pcm_uframes_t is defined as unsigned long so it would take
different sizes depending on 32 or 64bit architectures. As we don't
want this ABI incompatibility, and there is no real 64bit user yet,
let's make it the fixed size with __u32.
Also bump the protocol version number to 0.1.2.
Acked-by: Vinod Koul <vinod.koul@intel.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
commit 932e9dec38 upstream.
When running a 32bit kernel the hda_intel driver is still reporting
a 64bit dma_mask if the HW supports it.
From sound/pci/hda/hda_intel.c:
/* allow 64bit DMA address if supported by H/W */
if ((gcap & ICH6_GCAP_64OK) && !pci_set_dma_mask(pci, DMA_BIT_MASK(64)))
pci_set_consistent_dma_mask(pci, DMA_BIT_MASK(64));
else {
pci_set_dma_mask(pci, DMA_BIT_MASK(32));
pci_set_consistent_dma_mask(pci, DMA_BIT_MASK(32));
}
which means when there is a call to dma_alloc_coherent from
snd_malloc_dev_pages a machine address bigger than 32bit can be returned.
This can be true in particular if running the 32bit kernel as a pv dom0
under the Xen Hypervisor or PAE on bare metal.
The problem is that when calling setup_bdle to program the BLE the
dma_addr_t returned from the dma_alloc_coherent is wrongly truncated
from snd_sgbuf_get_addr if running a 32bit kernel:
static inline dma_addr_t snd_sgbuf_get_addr(struct snd_dma_buffer *dmab,
size_t offset)
{
struct snd_sg_buf *sgbuf = dmab->private_data;
dma_addr_t addr = sgbuf->table[offset >> PAGE_SHIFT].addr;
addr &= PAGE_MASK;
return addr + offset % PAGE_SIZE;
}
where PAGE_MASK in a 32bit kernel is zeroing the upper 32bit af addr.
Without this patch the HW will fetch the 32bit truncated address,
which is not the one obtained from dma_alloc_coherent and will result
to a non working audio but can corrupt host memory at a random location.
The current patch apply to v3.13-rc3-74-g6c843f5
Signed-off-by: Stefano Panella <stefano.panella@citrix.com>
Reviewed-by: Frediano Ziglio <frediano.ziglio@citrix.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
commit 4fc9bbf98f upstream.
Add a flag to tell the PCI subsystem that kernel is shutting down in
preparation to kexec a kernel. Add code in PCI subsystem to use this flag
to clear Bus Master bit on PCI devices only in case of kexec reboot.
This fixes a power-off problem on Acer Aspire V5-573G and likely other
machines and avoids any other issues caused by clearing Bus Master bit on
PCI devices in normal shutdown path. The problem was introduced by
b566a22c23 ("PCI: disable Bus Master on PCI device shutdown").
This patch is based on discussion at
http://marc.info/?l=linux-pci&m=138425645204355&w=2
Link: https://bugzilla.kernel.org/show_bug.cgi?id=63861
Reported-by: Chang Liu <cl91tp@gmail.com>
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Konstantin Khlebnikov <koct9i@gmail.com>
[ kamal: backport to 3.8 (context) ]
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
commit 389a539058 upstream.
Now that scatterwalk_sg_chain sets the chain pointer bit the sg_page
call in scatterwalk_sg_next hits a BUG_ON when CONFIG_DEBUG_SG is
enabled. Use sg_chain_ptr instead of sg_page on a chain entry.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
commit 54b2b50c20 upstream.
Some host adapters do not pass commands through to the target disk
directly. Instead they provide an emulated target which may or may not
accurately report its capabilities. In some cases the physical device
characteristics are reported even when the host adapter is processing
commands on the device's behalf. This can lead to adapter firmware hangs
or excessive I/O errors.
This patch disables WRITE SAME for devices connected to host adapters
that provide an emulated target. Driver writers can disable WRITE SAME
by setting the no_write_same flag in the host adapter template.
[jejb: fix up rejections due to eh_deadline patch]
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
[ kamal: backport to 3.8 (context: eh_deadline) ]
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
commit 41da8b5adb upstream.
The scatterwalk_crypto_chain function invokes the scatterwalk_sg_chain
function to chain two scatterlists, but the chain pointer indication
bit is not set. When the resulting scatterlist is used, for example,
by sg_nents to count the number of scatterlist entries, a segfault occurs
because sg_nents does not follow the chain pointer to the chained scatterlist.
Update scatterwalk_sg_chain to set the chain pointer indication bit as is
done by the sg_chain function.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
[ Upstream commit 85fbaa7503 ]
Commit bceaa90240 ("inet: prevent leakage
of uninitialized memory to user in recv syscalls") conditionally updated
addr_len if the msg_name is written to. The recv_error and rxpmtu
functions relied on the recvmsg functions to set up addr_len before.
As this does not happen any more we have to pass addr_len to those
functions as well and set it to the size of the corresponding sockaddr
length.
This broke traceroute and such.
Fixes: bceaa90240 ("inet: prevent leakage of uninitialized memory to user in recv syscalls")
Reported-by: Brad Spengler <spender@grsecurity.net>
Reported-by: Tom Labanowski
Cc: mpb <mpb.mail@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
[ kamal: backport to 3.8 ]
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
[ Upstream commit f3d3342602 ]
This patch now always passes msg->msg_namelen as 0. recvmsg handlers must
set msg_namelen to the proper size <= sizeof(struct sockaddr_storage)
to return msg_name to the user.
This prevents numerous uninitialized memory leaks we had in the
recvmsg handlers and makes it harder for new code to accidentally leak
uninitialized memory.
Optimize for the case recvfrom is called with NULL as address. We don't
need to copy the address at all, so set it to NULL before invoking the
recvmsg handler. We can do so, because all the recvmsg handlers must
cope with the case a plain read() is called on them. read() also sets
msg_name to NULL.
Also document these changes in include/linux/net.h as suggested by David
Miller.
Changes since RFC:
Set msg->msg_name = NULL if user specified a NULL in msg_name but had a
non-null msg_namelen in verify_iovec/verify_compat_iovec. This doesn't
affect sendto as it would bail out earlier while trying to copy-in the
address. It also more naturally reflects the logic by the callers of
verify_iovec.
With this change in place I could remove "
if (!uaddr || msg_sys->msg_namelen == 0)
msg->msg_name = NULL
".
This change does not alter the user visible error logic as we ignore
msg_namelen as long as msg_name is NULL.
Also remove two unnecessary curly brackets in ___sys_recvmsg and change
comments to netdev style.
Cc: David Miller <davem@davemloft.net>
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
[ Upstream commit 51c37a70aa ]
For properly initialising the Tausworthe generator [1], we have
a strict seeding requirement, that is, s1 > 1, s2 > 7, s3 > 15.
Commit 697f8d0348 ("random32: seeding improvement") introduced
a __seed() function that imposes boundary checks proposed by the
errata paper [2] to properly ensure above conditions.
However, we're off by one, as the function is implemented as:
"return (x < m) ? x + m : x;", and called with __seed(X, 1),
__seed(X, 7), __seed(X, 15). Thus, an unwanted seed of 1, 7, 15
would be possible, whereas the lower boundary should actually
be of at least 2, 8, 16, just as GSL does. Fix this, as otherwise
an initialization with an unwanted seed could have the effect
that Tausworthe's PRNG properties cannot not be ensured.
Note that this PRNG is *not* used for cryptography in the kernel.
[1] http://www.iro.umontreal.ca/~lecuyer/myftp/papers/tausme.ps
[2] http://www.iro.umontreal.ca/~lecuyer/myftp/papers/tausme2.ps
Joint work with Hannes Frederic Sowa.
Fixes: 697f8d0348 ("random32: seeding improvement")
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
commit 4e9b45a192 upstream.
On 64 bit systems the test for negative message sizes is bogus as the
size, which may be positive when evaluated as a long, will get truncated
to an int when passed to load_msg(). So a long might very well contain a
positive value but when truncated to an int it would become negative.
That in combination with a small negative value of msg_ctlmax (which will
be promoted to an unsigned type for the comparison against msgsz, making
it a big positive value and therefore make it pass the check) will lead to
two problems: 1/ The kmalloc() call in alloc_msg() will allocate a too
small buffer as the addition of alen is effectively a subtraction. 2/ The
copy_from_user() call in load_msg() will first overflow the buffer with
userland data and then, when the userland access generates an access
violation, the fixup handler copy_user_handle_tail() will try to fill the
remainder with zeros -- roughly 4GB. That almost instantly results in a
system crash or reset.
,-[ Reproducer (needs to be run as root) ]--
| #include <sys/stat.h>
| #include <sys/msg.h>
| #include <unistd.h>
| #include <fcntl.h>
|
| int main(void) {
| long msg = 1;
| int fd;
|
| fd = open("/proc/sys/kernel/msgmax", O_WRONLY);
| write(fd, "-1", 2);
| close(fd);
|
| msgsnd(0, &msg, 0xfffffff0, IPC_NOWAIT);
|
| return 0;
| }
'---
Fix the issue by preventing msgsz from getting truncated by consistently
using size_t for the message length. This way the size checks in
do_msgsnd() could still be passed with a negative value for msg_ctlmax but
we would fail on the buffer allocation in that case and error out.
Also change the type of m_ts from int to size_t to avoid similar nastiness
in other code paths -- it is used in similar constructs, i.e. signed vs.
unsigned checks. It should never become negative under normal
circumstances, though.
Setting msg_ctlmax to a negative value is an odd configuration and should
be prevented. As that might break existing userland, it will be handled
in a separate commit so it could easily be reverted and reworked without
reintroducing the above described bug.
Hardening mechanisms for user copy operations would have catched that bug
early -- e.g. checking slab object sizes on user copy operations as the
usercopy feature of the PaX patch does. Or, for that matter, detect the
long vs. int sign change due to truncation, as the size overflow plugin
of the very same patch does.
[akpm@linux-foundation.org: fix i386 min() warnings]
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Pax Team <pageexec@freemail.hu>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
commit d049f74f2d upstream.
The get_dumpable() return value is not boolean. Most users of the
function actually want to be testing for non-SUID_DUMP_USER(1) rather than
SUID_DUMP_DISABLE(0). The SUID_DUMP_ROOT(2) is also considered a
protected state. Almost all places did this correctly, excepting the two
places fixed in this patch.
Wrong logic:
if (dumpable == SUID_DUMP_DISABLE) { /* be protective */ }
or
if (dumpable == 0) { /* be protective */ }
or
if (!dumpable) { /* be protective */ }
Correct logic:
if (dumpable != SUID_DUMP_USER) { /* be protective */ }
or
if (dumpable != 1) { /* be protective */ }
Without this patch, if the system had set the sysctl fs/suid_dumpable=2, a
user was able to ptrace attach to processes that had dropped privileges to
that user. (This may have been partially mitigated if Yama was enabled.)
The macros have been moved into the file that declares get/set_dumpable(),
which means things like the ia64 code can see them too.
CVE-2013-2929
Reported-by: Vasily Kulikov <segoon@openwall.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
commit 72403b4a0f upstream.
Commit 0255d49184 ("mm: Account for a THP NUMA hinting update as one
PTE update") was added to account for the number of PTE updates when
marking pages prot_numa. task_numa_work was using the old return value
to track how much address space had been updated. Altering the return
value causes the scanner to do more work than it is configured or
documented to in a single unit of work.
This patch reverts that commit and accounts for the number of THP
updates separately in vmstat. It is up to the administrator to
interpret the pair of values correctly. This is a straight-forward
operation and likely to only be of interest when actively debugging NUMA
balancing problems.
The impact of this patch is that the NUMA PTE scanner will scan slower
when THP is enabled and workloads may converge slower as a result. On
the flip size system CPU usage should be lower than recent tests
reported. This is an illustrative example of a short single JVM specjbb
test
specjbb
3.12.0 3.12.0
vanilla acctupdates
TPut 1 26143.00 ( 0.00%) 25747.00 ( -1.51%)
TPut 7 185257.00 ( 0.00%) 183202.00 ( -1.11%)
TPut 13 329760.00 ( 0.00%) 346577.00 ( 5.10%)
TPut 19 442502.00 ( 0.00%) 460146.00 ( 3.99%)
TPut 25 540634.00 ( 0.00%) 549053.00 ( 1.56%)
TPut 31 512098.00 ( 0.00%) 519611.00 ( 1.47%)
TPut 37 461276.00 ( 0.00%) 474973.00 ( 2.97%)
TPut 43 403089.00 ( 0.00%) 414172.00 ( 2.75%)
3.12.0 3.12.0
vanillaacctupdates
User 5169.64 5184.14
System 100.45 80.02
Elapsed 252.75 251.85
Performance is similar but note the reduction in system CPU time. While
this showed a performance gain, it will not be universal but at least
it'll be behaving as documented. The vmstats are obviously different but
here is an obvious interpretation of them from mmtests.
3.12.0 3.12.0
vanillaacctupdates
NUMA page range updates 1408326 11043064
NUMA huge PMD updates 0 21040
NUMA PTE updates 1408326 291624
"NUMA page range updates" == nr_pte_updates and is the value returned to
the NUMA pte scanner. NUMA huge PMD updates were the number of THP
updates which in combination can be used to calculate how many ptes were
updated from userspace.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Alex Thorlton <athorlton@sgi.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
commit a4d62babf9 upstream.
Hardware:
CPU: XLP832,the 64-bit OS
NOR Flash:S29GL128S 128M
Software:
Kernel:2.6.32.41
Filesystem:JFFS2
When writing files, errors appear:
Write len 182 but return retlen 180
Write of 182 bytes at 0x072c815c failed. returned -5, retlen 180
Write len 186 but return retlen 184
Write of 186 bytes at 0x072caff4 failed. returned -5, retlen 184
These errors exist only in 64-bit systems,not in 32-bit systems. After analysis, we
found that the left shift operation is wrong in map_word_load_partial. For instance:
unsigned char buf[3] ={0x9e,0x3a,0xea};
map_bankwidth(map) is 4;
for (i=0; i < 3; i++) {
int bitpos;
bitpos = (map_bankwidth(map)-1-i)*8;
orig.x[0] &= ~(0xff << bitpos);
orig.x[0] |= buf[i] << bitpos;
}
The value of orig.x[0] is expected to be 0x9e3aeaff, but in this situation(64-bit
System) we'll get the wrong value of 0xffffffff9e3aeaff due to the 64-bit sign
extension:
buf[i] is defined as "unsigned char" and the left-shift operation will convert it
to the type of "signed int", so when left-shift buf[i] by 24 bits, the final result
will get the wrong value: 0xffffffff9e3aeaff.
If the left-shift bits are less than 24, then sign extension will not occur. Whereas
the bankwidth of the nor flash we used is 4, therefore this BUG emerges.
Signed-off-by: Pang Xunlei <pang.xunlei@zte.com.cn>
Signed-off-by: Zhang Yi <zhang.yi20@zte.com.cn>
Signed-off-by: Lu Zhongjun <lu.zhongjun@zte.com.cn>
Signed-off-by: Brian Norris <computersforpeace@gmail.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
[ Upstream commit 01ba16d6ec ]
On receiving a packet too big icmp error we update the expire value by
calling rt6_update_expires. This function uses dst_set_expires which is
implemented that it can only reduce the expiration value of the dst entry.
If we insert new routing non-expiry information into the ipv6 fib where
we already have a matching rt6_info we only clear the RTF_EXPIRES flag
in rt6i_flags and leave the dst.expires value as is.
When new mtu information arrives for that cached dst_entry we again
call dst_set_expires. This time it won't update the dst.expire value
because we left the dst.expire value intact from the last update. So
dst_set_expires won't touch dst.expires.
Fix this by resetting dst.expires when clearing the RTF_EXPIRE flag.
dst_set_expires checks for a zero expiration and updates the
dst.expires.
In the past this (not updating dst.expires) was necessary because
dst.expire was placed in a union with the dst_entry *from reference
and rt6_clean_expires did assign NULL to it. This split happend in
ecd9883724 ("ipv6: fix race condition
regarding dst->expires and dst->from").
Reported-by: Steinar H. Gunderson <sgunderson@bigfoot.com>
Reported-by: Valentijn Sessink <valentyn@blub.net>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Tested-by: Valentijn Sessink <valentyn@blub.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
commit 5017b28513 upstream.
dmi_match() considers a substring match to be a successful match. This is
not always sufficient to distinguish between DMI data for different
systems. Add support for exact string matching using strcmp() in addition
to the substring matching using strstr().
The specific use case in the i915 driver is to allow us to use an exact
match for D510MO, without also incorrectly matching D510MOV:
{
.ident = "Intel D510MO",
.matches = {
DMI_MATCH(DMI_BOARD_VENDOR, "Intel"),
DMI_EXACT_MATCH(DMI_BOARD_NAME, "D510MO"),
},
}
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Cc: <annndddrr@gmail.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Cornel Panceac <cpanceac@gmail.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[ kamal: 3.8-stable prereq for various commits ]
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
commit bc5bd37ce4 upstream.
Pavel Roskin reported that DRM_IOCTL_MODE_GETCONNECTOR was overwritting
the 4 bytes beyond the end of its structure with a 32-bit userspace
running on a 64-bit kernel. This is due to the padding gcc inserts as
the drm_mode_get_connector struct includes a u64 and its size is not a
natural multiple of u64s.
64-bit kernel:
sizeof(drm_mode_get_connector)=80, alignof=8
sizeof(drm_mode_get_encoder)=20, alignof=4
sizeof(drm_mode_modeinfo)=68, alignof=4
32-bit userspace:
sizeof(drm_mode_get_connector)=76, alignof=4
sizeof(drm_mode_get_encoder)=20, alignof=4
sizeof(drm_mode_modeinfo)=68, alignof=4
Fortuituously we can insert explicit padding to the tail of our
structures without breaking ABI.
Reported-by: Pavel Roskin <proski@gnu.org>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Dave Airlie <airlied@redhat.com>
Cc: dri-devel@lists.freedesktop.org
Signed-off-by: Dave Airlie <airlied@redhat.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
[ This is a simplified -stable version of a set of upstream commits. ]
This is a replacement patch only for stable which does fix the problems
handled by the following two commits in -net:
"ip_output: do skb ufo init for peeked non ufo skb as well" (e93b7d748b)
"ip6_output: do skb ufo init for peeked non ufo skb as well" (c547dbf55d)
Three frames are written on a corked udp socket for which the output
netdevice has UFO enabled. If the first and third frame are smaller than
the mtu and the second one is bigger, we enqueue the second frame with
skb_append_datato_frags without initializing the gso fields. This leads
to the third frame appended regulary and thus constructing an invalid skb.
This fixes the problem by always using skb_append_datato_frags as soon
as the first frag got enqueued to the skb without marking the packet
as SKB_GSO_UDP.
The problem with only two frames for ipv6 was fixed by "ipv6: udp
packets following an UFO enqueued packet need also be handled by UFO"
(2811ebac25).
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
[ Upstream commit f2e5ddcc0d ]
When CONFIG_NETLABEL is disabled, the cipso_v4_validate() function could loop
forever in the main loop if opt[opt_iter +1] == 0, this will causing a kernel
crash in an SMP system, since the CPU executing this function will
stall /not respond to IPIs.
This problem can be reproduced by running the IP Stack Integrity Checker
(http://isic.sourceforge.net) using the following command on a Linux machine
connected to DUT:
"icmpsic -s rand -d <DUT IP address> -r 123456"
wait (1-2 min)
Signed-off-by: Seif Mazareeb <seif@marvell.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
commit 14bbd6a565 upstream.
This function will be used in next GRE_GSO patch. This patch does
not change any functionality.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Eric Dumazet <edumazet@google.com>
[ kamal: 3.8-stable prereq for c52e2421f7
"tcp: must unclone packets before mangling them" ]
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
commit 95bd09eb27 upstream.
commit 02cf4ebd82 upstream.
commit 7eec4174ff upstream.
After hearing many people over past years complaining against TSO being
bursty or even buggy, we are proud to present automatic sizing of TSO
packets.
One part of the problem is that tcp_tso_should_defer() uses an heuristic
relying on upcoming ACKS instead of a timer, but more generally, having
big TSO packets makes little sense for low rates, as it tends to create
micro bursts on the network, and general consensus is to reduce the
buffering amount.
This patch introduces a per socket sk_pacing_rate, that approximates
the current sending rate, and allows us to size the TSO packets so
that we try to send one packet every ms.
This field could be set by other transports.
Patch has no impact for high speed flows, where having large TSO packets
makes sense to reach line rate.
For other flows, this helps better packet scheduling and ACK clocking.
This patch increases performance of TCP flows in lossy environments.
A new sysctl (tcp_min_tso_segs) is added, to specify the
minimal size of a TSO packet (default being 2).
A follow-up patch will provide a new packet scheduler (FQ), using
sk_pacing_rate as an input to perform optional per flow pacing.
This explains why we chose to set sk_pacing_rate to twice the current
rate, allowing 'slow start' ramp up.
sk_pacing_rate = 2 * cwnd * mss / srtt
v2: Neal Cardwell reported a suspect deferring of last two segments on
initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
into account tp->xmit_size_goal_segs
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Cc: Tom Herbert <therbert@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[ kamal: backport to 3.8 (context) ]
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
commit 32c37fc30c upstream.
Some USB drive enclosures do not correctly report an
overflow condition if they hold a drive with a capacity
over 2TB and are confronted with a READ_CAPACITY_10.
They answer with their capacity modulo 2TB.
The generic layer cannot cope with that. It must be told
to use READ_CAPACITY_16 from the beginning.
Signed-off-by: Oliver Neukum <oneukum@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
[ Upstream commit 703133de33 ]
If local fragmentation is allowed, then ip_select_ident() and
ip_select_ident_more() need to generate unique IDs to ensure
correct defragmentation on the peer.
For example, if IPsec (tunnel mode) has to encrypt large skbs
that have local_df bit set, then all IP fragments that belonged
to different ESP datagrams would have used the same identificator.
If one of these IP fragments would get lost or reordered, then
peer could possibly stitch together wrong IP fragments that did
not belong to the same datagram. This would lead to a packet loss
or data corruption.
Signed-off-by: Ansis Atteka <aatteka@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[ kamal: backport to 3.8 (context) ]
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
commit 3f3f8d2f48 upstream.
Throughout compiler*.h, many version checks are made. These can be
simplified by using the macro that gcc's documentation recommends.
However, my primary reason for adding this is that I need bug-check
macros that are enabled at certain gcc versions and it's cleaner to use
this macro than the tradition method:
#if __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ => 2)
If you add patch level, it gets this ugly:
#if __GNUC__ > 4 || (__GNUC__ == 4 && (__GNUC_MINOR__ > 2 || \
__GNUC_MINOR__ == 2 __GNUC_PATCHLEVEL__ >= 1))
As opposed to:
#if GCC_VERSION >= 40201
While having separate headers for gcc 3 & 4 eliminates some of this
verbosity, they can still be cleaned up by this.
See also:
http://gcc.gnu.org/onlinedocs/cpp/Common-Predefined-Macros.html
Signed-off-by: Daniel Santos <daniel.santos@pobox.com>
Acked-by: Borislav Petkov <bp@alien8.de>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Joe Perches <joe@perches.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[ kamal: 3.8-stable prereq for
3f0116c "compiler/gcc4: Add quirk for 'asm goto' miscompilation bug" ]
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
commit 61875f30da upstream.
Allow architectures which have a disabled get_cycles() function to
provide a random_get_entropy() function which provides a fine-grained,
rapidly changing counter that can be used by the /dev/random driver.
For example, an architecture might have a rapidly changing register
used to control random TLB cache eviction, or DRAM refresh that
doesn't meet the requirements of get_cycles(), but which is good
enough for the needs of the random driver.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
commit 47d06e532e upstream.
The some platforms (e.g., ARM) initializes their clocks as
late_initcalls for some unknown reason. So make sure
random_int_secret_init() is run after all of the late_initcalls are
run.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
[ kamal: backport to 3.8 (context) ]
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
commit 0c12582fbc upstream.
Dmitry Akindinov is reporting for a problem where SYNs are looping
between the master and backup server when the backup server is used as
real server in DR mode and has IPVS rules to function as director.
Even when the backup function is enabled we continue to forward
traffic and schedule new connections when the current master is using
the backup server as real server. While this is not a problem for NAT,
for DR and TUN method the backup server can not determine if a request
comes from client or from director.
To avoid such loops add new sysctl flag backup_only. It can be needed
for DR/TUN setups that do not need backup and director function at the
same time. When the backup function is enabled we stop any forwarding
and pass the traffic to the local stack (real server mode). The flag
disables the director function when the backup function is enabled.
For setups that enable backup function for some virtual services and
director function for other virtual services there should be another
more complex solution to support DR/TUN mode, may be to assign
per-virtual service syncid value, so that we can differentiate the
requests.
Reported-by: Dmitry Akindinov <dimak@stalker.com>
Tested-by: German Myzovsky <lawyer@sipnet.ru>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Cc: Luis Henriques <luis.henriques@canonical.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
commit 5e130367d4 upstream.
This makes it more convenient to check for rfkill (no need to check for
dev->rfkill before calling rfkill_blocked()) and also avoids potential
races if the RFKILL state needs to be checked from within the rfkill
callback.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Acked-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
commit 331415ff16 upstream.
Many drivers need to validate the characteristics of their HID report
during initialization to avoid misusing the reports. This adds a common
helper to perform validation of the report exisitng, the field existing,
and the expected number of values within the field.
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
commit e5b9e7503e upstream.
Also add a new RADEON_INFO query to check that CP DMA packets are
supported on the compute ring.
CP DMA has been supported since the 3.8 kernel, but due to an oversight
we forgot to teach the CS checker that the CP DMA packet was legal for
the compute ring on Southern Islands GPUs.
This patch fixes a bug where the radeon driver will incorrectly reject a legal
CP DMA packet from user space. I would like to have the patch
backported to stable so that we don't have to require Mesa users to use a
bleeding edge kernel in order to take advantage of this feature which
is already present in the stable kernels (3.8 and newer).
v2:
- Don't bump kms version, so this patch can be backported to stable
kernels.
Signed-off-by: Tom Stellard <thomas.stellard@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[ kamal: backport to 3.8 ]
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
[ Upstream commit 61e76b178d ]
RFC 4443 has defined two additional codes for ICMPv6 type 1 (destination
unreachable) messages:
5 - Source address failed ingress/egress policy
6 - Reject route to destination
Now they are treated as protocol error and icmpv6_err_convert() converts them
to EPROTO.
RFC 4443 says:
"Codes 5 and 6 are more informative subsets of code 1."
Treat codes 5 and 6 as code 1 (EACCES)
Btw, connect() returning -EPROTO confuses firefox, so that fallback to
other/IPv4 addresses does not work:
https://bugzilla.mozilla.org/show_bug.cgi?id=910773
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
[ Upstream commit f46078cfcd ]
It is not allowed for an ipv6 packet to contain multiple fragmentation
headers. So discard packets which were already reassembled by
fragmentation logic and send back a parameter problem icmp.
The updates for RFC 6980 will come in later, I have to do a bit more
research here.
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
commit dd3332bfcb upstream.
Router Alert option is very small and we can store the value
itself in the skb.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
[ kamal: 3.8.y-stable prereq for f46078cfcd
"ipv6: drop packets with multiple fragmentation headers" ]
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
commit a19dec6ea9 upstream.
This patch fixes following error:
include/media/v4l2-ctrls.h:193:15: error: field ‘_lock’ has incomplete type
include/media/v4l2-ctrls.h: In function ‘v4l2_ctrl_lock’:
include/media/v4l2-ctrls.h:570:2: error: implicit declaration of
function ‘mutex_lock’ [-Werror=implicit-function-declaration]
include/media/v4l2-ctrls.h: In function ‘v4l2_ctrl_unlock’:
include/media/v4l2-ctrls.h:579:2: error: implicit declaration of
function ‘mutex_unlock’ [-Werror=implicit-function-declaration]
Signed-off-by: Andrzej Hajda <a.hajda@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>