[ Upstream commit a5b1379afb ]
Fix initialize the uninitialized tx_qlen to an appropriate value when USB
Full Speed is used.
Fixes: 55d7de9de6 ("Microchip's LAN7800 family USB 2/3 to 10/100/1000 Ethernet device driver")
Signed-off-by: Yuiko Oshino <yuiko.oshino@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 95ef498d97 ]
In my last patch, I missed fact that cork.base.dst was not initialized
in ip6_make_skb() :
If ip6_setup_cork() returns an error, we might attempt a dst_release()
on some random pointer.
Fixes: 862c03ee1d ("ipv6: fix possible mem leaks in ipv6_make_skb()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit e9191ffb65 ]
Commit 513674b5a2 ("net: reevalulate autoflowlabel setting after
sysctl setting") removed the initialisation of
ipv6_pinfo::autoflowlabel and added a second flag to indicate
whether this field or the net namespace default should be used.
The getsockopt() handling for this case was not updated, so it
currently returns 0 for all sockets for which IPV6_AUTOFLOWLABEL is
not explicitly enabled. Fix it to return the effective value, whether
that has been set at the socket or net namespace level.
Fixes: 513674b5a2 ("net: reevalulate autoflowlabel setting after sysctl ...")
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit dd5684ecae ]
ccid2_hc_tx_rto_expire() timer callback always restarts the timer
again and can run indefinitely (unless it is stopped outside), and after
commit 120e9dabaf ("dccp: defer ccid_hc_tx_delete() at dismantle time"),
which moved ccid_hc_tx_delete() (also includes sk_stop_timer()) from
dccp_destroy_sock() to sk_destruct(), this started to happen quite often.
The timer prevents releasing the socket, as a result, sk_destruct() won't
be called.
Found with LTP/dccp_ipsec tests running on the bonding device,
which later couldn't be unloaded after the tests were completed:
unregister_netdevice: waiting for bond0 to become free. Usage count = 148
Fixes: 2a91aa3967 ("[DCCP] CCID2: Initial CCID2 (TCP-Like) implementation")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7e04072685 upstream.
[resend due to me forgetting to cc: linux-api the first time around I
posted these back on Feb 23]
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
For some reason these values are not in the uapi header file, so any
libc has to define it themselves. To prevent them from needing to do
this, just have the kernel provide the correct values.
Reported-by: Elliott Hughes <enh@google.com>
Signed-off-by: Greg Hackmann <ghackmann@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The backport of KAISER to 4.4 turned vsyscall emulate mode into native
mode. Add a vsyscall_pgprot variable to hold the correct page
protections, like Borislav and Hugh did for 3.2 and 3.18.
Cc: Borislav Petkov <bp@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 883354afbc upstream.
Debian's gcc defaults to pie. The global Makefile already defines the -fno-pie option.
Link UML dynamic kernel image also with -no-pie to fix the build.
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
Cc: Bernie Innocenti <codewiz@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f5d07b9e98 upstream.
Lenovo introduced trackpoint compatible sticks with minimum PS/2 commands.
They supposed to reply with 0x02, 0x03, or 0x04 in response to the
"Read Extended ID" command, so we would know not to try certain extended
commands. Unfortunately even some trackpoints reporting the original IBM
version (0x01 firmware 0x0e) now respond with incorrect data to the "Get
Extended Buttons" command:
thinkpad_acpi: ThinkPad BIOS R0DET87W (1.87 ), EC unknown
thinkpad_acpi: Lenovo ThinkPad E470, model 20H1004SGE
psmouse serio2: trackpoint: IBM TrackPoint firmware: 0x0e, buttons: 0/0
Since there are no trackpoints without buttons, let's assume the trackpoint
has 3 buttons when we get 0 response to the extended buttons query.
Signed-off-by: Aaron Ma <aaron.ma@canonical.com>
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=196253
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c73322d098 upstream.
Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and
cleanups".
Jia reported a scenario in which the kswapd of a node indefinitely spins
at 100% CPU usage. We have seen similar cases at Facebook.
The kernel's current method of judging its ability to reclaim a node (or
whether to back off and sleep) is based on the amount of scanned pages
in proportion to the amount of reclaimable pages. In Jia's and our
scenarios, there are no reclaimable pages in the node, however, and the
condition for backing off is never met. Kswapd busyloops in an attempt
to restore the watermarks while having nothing to work with.
This series reworks the definition of an unreclaimable node based not on
scanning but on whether kswapd is able to actually reclaim pages in
MAX_RECLAIM_RETRIES (16) consecutive runs. This is the same criteria
the page allocator uses for giving up on direct reclaim and invoking the
OOM killer. If it cannot free any pages, kswapd will go to sleep and
leave further attempts to direct reclaim invocations, which will either
make progress and re-enable kswapd, or invoke the OOM killer.
Patch #1 fixes the immediate problem Jia reported, the remainder are
smaller fixlets, cleanups, and overall phasing out of the old method.
Patch #6 is the odd one out. It's a nice cleanup to get_scan_count(),
and directly related to #5, but in itself not relevant to the series.
If the whole series is too ambitious for 4.11, I would consider the
first three patches fixes, the rest cleanups.
This patch (of 9):
Jia He reports a problem with kswapd spinning at 100% CPU when
requesting more hugepages than memory available in the system:
$ echo 4000 >/proc/sys/vm/nr_hugepages
top - 13:42:59 up 3:37, 1 user, load average: 1.09, 1.03, 1.01
Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 12.5 sy, 0.0 ni, 85.5 id, 2.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 31371520 total, 30915136 used, 456384 free, 320 buffers
KiB Swap: 6284224 total, 115712 used, 6168512 free. 48192 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
76 root 20 0 0 0 0 R 100.0 0.000 217:17.29 kswapd3
At that time, there are no reclaimable pages left in the node, but as
kswapd fails to restore the high watermarks it refuses to go to sleep.
Kswapd needs to back away from nodes that fail to balance. Up until
commit 1d82de618d ("mm, vmscan: make kswapd reclaim in terms of
nodes") kswapd had such a mechanism. It considered zones whose
theoretically reclaimable pages it had reclaimed six times over as
unreclaimable and backed away from them. This guard was erroneously
removed as the patch changed the definition of a balanced node.
However, simply restoring this code wouldn't help in the case reported
here: there *are* no reclaimable pages that could be scanned until the
threshold is met. Kswapd would stay awake anyway.
Introduce a new and much simpler way of backing off. If kswapd runs
through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
page, make it back off from the node. This is the same number of shots
direct reclaim takes before declaring OOM. Kswapd will go to sleep on
that node until a direct reclaimer manages to reclaim some pages, thus
proving the node reclaimable again.
[hannes@cmpxchg.org: check kswapd failure against the cumulative nr_reclaimed count]
Link: http://lkml.kernel.org/r/20170306162410.GB2090@cmpxchg.org
[shakeelb@google.com: fix condition for throttle_direct_reclaim]
Link: http://lkml.kernel.org/r/20170314183228.20152-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20170228214007.5621-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reported-by: Jia He <hejianet@gmail.com>
Tested-by: Jia He <hejianet@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dmitry Shmidt <dimitrysh@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fc3dc67471 upstream.
fcntl(0, F_SETOWN, 0x80000000) triggers:
UBSAN: Undefined behaviour in fs/fcntl.c:118:7
negation of -2147483648 cannot be represented in type 'int':
CPU: 1 PID: 18261 Comm: syz-executor Not tainted 4.8.1-0-syzkaller #1
...
Call Trace:
...
[<ffffffffad8f0868>] ? f_setown+0x1d8/0x200
[<ffffffffad8f19a9>] ? SyS_fcntl+0x999/0xf30
[<ffffffffaed1fb00>] ? entry_SYSCALL_64_fastpath+0x23/0xc1
Fix that by checking the arg parameter properly (against INT_MAX) before
"who = -who". And return immediatelly with -EINVAL in case it is wrong.
Note that according to POSIX we can return EINVAL:
http://pubs.opengroup.org/onlinepubs/9699919799/functions/fcntl.html
[EINVAL]
The cmd argument is F_SETOWN and the value of the argument
is not valid as a process or process group identifier.
[v2] returns an error, v1 used to fail silently
[v3] implement proper check for the bad value INT_MIN
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 54930dfeb4 upstream.
Most extended attributes will fit in a single block. More importantly,
we drop the reference to the inode while holding the transaction open
so the preallocated blocks aren't released. As a result, the inode
may be evicted before it's removed from the transaction's prealloc list
which can cause memory corruption.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 08db141b53 upstream.
The main loop in __discard_prealloc is protected by the reiserfs write lock
which is dropped across schedules like the BKL it replaced. The problem is
that it checks the value, calls a routine that schedules, and then adjusts
the state. As a result, two threads that are calling
reiserfs_prealloc_discard at the same time can race when one calls
reiserfs_free_prealloc_block, the lock is dropped, and the other calls
reiserfs_free_prealloc_block with the same block number. In the right
circumstances, it can cause the prealloc count to go negative.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 916a27901d upstream.
The capability check in nfnetlink_rcv() verifies that the caller
has CAP_NET_ADMIN in the namespace that "owns" the netlink socket.
However, xt_osf_fingers is shared by all net namespaces on the
system. An unprivileged user can create user and net namespaces
in which he holds CAP_NET_ADMIN to bypass the netlink_net_capable()
check:
vpnns -- nfnl_osf -f /tmp/pf.os
vpnns -- nfnl_osf -f /tmp/pf.os -d
These non-root operations successfully modify the systemwide OS
fingerprint list. Add new capable() checks so that they can't.
Signed-off-by: Kevin Cernekee <cernekee@chromium.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4b380c42f7 upstream.
The capability check in nfnetlink_rcv() verifies that the caller
has CAP_NET_ADMIN in the namespace that "owns" the netlink socket.
However, nfnl_cthelper_list is shared by all net namespaces on the
system. An unprivileged user can create user and net namespaces
in which he holds CAP_NET_ADMIN to bypass the netlink_net_capable()
check:
$ nfct helper list
nfct v1.4.4: netlink error: Operation not permitted
$ vpnns -- nfct helper list
{
.name = ftp,
.queuenum = 0,
.l3protonum = 2,
.l4protonum = 6,
.priv_data_len = 24,
.status = enabled,
};
Add capable() checks in nfnetlink_cthelper, as this is cleaner than
trying to generalize the solution.
Signed-off-by: Kevin Cernekee <cernekee@chromium.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3b2d69114f upstream.
ACPICA commit a23325b2e583556eae88ed3f764e457786bf4df6
I found some ACPI operand cache leaks in ACPI early abort cases.
Boot log of ACPI operand cache leak is as follows:
>[ 0.174332] ACPI: Added _OSI(Module Device)
>[ 0.175504] ACPI: Added _OSI(Processor Device)
>[ 0.176010] ACPI: Added _OSI(3.0 _SCP Extensions)
>[ 0.177032] ACPI: Added _OSI(Processor Aggregator Device)
>[ 0.178284] ACPI: SCI (IRQ16705) allocation failed
>[ 0.179352] ACPI Exception: AE_NOT_ACQUIRED, Unable to install
System Control Interrupt handler (20160930/evevent-131)
>[ 0.180008] ACPI: Unable to start the ACPI Interpreter
>[ 0.181125] ACPI Error: Could not remove SCI handler
(20160930/evmisc-281)
>[ 0.184068] kmem_cache_destroy Acpi-Operand: Slab cache still has
objects
>[ 0.185358] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.10.0-rc3 #2
>[ 0.186820] Hardware name: innotek gmb_h virtual_box/virtual_box, BIOS
virtual_box 12/01/2006
>[ 0.188000] Call Trace:
>[ 0.188000] ? dump_stack+0x5c/0x7d
>[ 0.188000] ? kmem_cache_destroy+0x224/0x230
>[ 0.188000] ? acpi_sleep_proc_init+0x22/0x22
>[ 0.188000] ? acpi_os_delete_cache+0xa/0xd
>[ 0.188000] ? acpi_ut_delete_caches+0x3f/0x7b
>[ 0.188000] ? acpi_terminate+0x5/0xf
>[ 0.188000] ? acpi_init+0x288/0x32e
>[ 0.188000] ? __class_create+0x4c/0x80
>[ 0.188000] ? video_setup+0x7a/0x7a
>[ 0.188000] ? do_one_initcall+0x4e/0x1b0
>[ 0.188000] ? kernel_init_freeable+0x194/0x21a
>[ 0.188000] ? rest_init+0x80/0x80
>[ 0.188000] ? kernel_init+0xa/0x100
>[ 0.188000] ? ret_from_fork+0x25/0x30
When early abort is occurred due to invalid ACPI information, Linux kernel
terminates ACPI by calling acpi_terminate() function. The function calls
acpi_ns_terminate() function to delete namespace data and ACPI operand cache
(acpi_gbl_module_code_list).
But the deletion code in acpi_ns_terminate() function is wrapped in
ACPI_EXEC_APP definition, therefore the code is only executed when the
definition exists. If the define doesn't exist, ACPI operand cache
(acpi_gbl_module_code_list) is leaked, and stack dump is shown in kernel log.
This causes a security threat because the old kernel (<= 4.9) shows memory
locations of kernel functions in stack dump, therefore kernel ASLR can be
neutralized.
To fix ACPI operand leak for enhancing security, I made a patch which
removes the ACPI_EXEC_APP define in acpi_ns_terminate() function for
executing the deletion code unconditionally.
Link: https://github.com/acpica/acpica/commit/a23325b2
Signed-off-by: Seunghun Han <kkamagui@gmail.com>
Signed-off-by: Lv Zheng <lv.zheng@intel.com>
Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Lee, Chun-Yi <jlee@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c2a6bbaf0c upstream.
The way acpi_find_child_device() works currently is that, if there
are two (or more) devices with the same _ADR value in the same
namespace scope (which is not specifically allowed by the spec and
the OS behavior in that case is not defined), the first one of them
found to be present (with the help of _STA) will be returned.
This covers the majority of cases, but is not sufficient if some of
the devices in question have a _HID (or _CID) returning some valid
ACPI/PNP device IDs (which is disallowed by the spec) and the
ASL writers' expectation appears to be that the OS will match
devices without a valid ACPI/PNP device ID against a given bus
address first.
To cover this special case as well, modify find_child_checks()
to prefer devices without ACPI/PNP device IDs over devices that
have them.
Suggested-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 999898355e upstream.
When LONG_MIN is passed to msgrcv, one would expect to recieve any
message. But convert_mode does *msgtyp = -*msgtyp and -LONG_MIN is
undefined. In particular, with my gcc -LONG_MIN produces -LONG_MIN
again.
So handle this case properly by assigning LONG_MAX to *msgtyp if
LONG_MIN was specified as msgtyp to msgrcv.
This code:
long msg[] = { 100, 200 };
int m = msgget(IPC_PRIVATE, IPC_CREAT | 0644);
msgsnd(m, &msg, sizeof(msg), 0);
msgrcv(m, &msg, sizeof(msg), LONG_MIN, 0);
produces currently nothing:
msgget(IPC_PRIVATE, IPC_CREAT|0644) = 65538
msgsnd(65538, {100, "\310\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"}, 16, 0) = 0
msgrcv(65538, ...
Except a UBSAN warning:
UBSAN: Undefined behaviour in ipc/msg.c:745:13
negation of -9223372036854775808 cannot be represented in type 'long int':
With the patch, I see what I expect:
msgget(IPC_PRIVATE, IPC_CREAT|0644) = 0
msgsnd(0, {100, "\310\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"}, 16, 0) = 0
msgrcv(0, {100, "\310\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"}, 16, -9223372036854775808, 0) = 16
Link: http://lkml.kernel.org/r/20161024082633.10148-1-jslaby@suse.cz
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b050e3769c upstream.
Since commit 97a16fc82a ("mm, page_alloc: only enforce watermarks for
order-0 allocations"), __zone_watermark_ok() check for high-order
allocations will shortcut per-migratetype free list checks for
ALLOC_HARDER allocations, and return true as long as there's free page
of any migratetype. The intention is that ALLOC_HARDER can allocate
from MIGRATE_HIGHATOMIC free lists, while normal allocations can't.
However, as a side effect, the watermark check will then also return
true when there are pages only on the MIGRATE_ISOLATE list, or (prior to
CMA conversion to ZONE_MOVABLE) on the MIGRATE_CMA list. Since the
allocation cannot actually obtain isolated pages, and might not be able
to obtain CMA pages, this can result in a false positive.
The condition should be rare and perhaps the outcome is not a fatal one.
Still, it's better if the watermark check is correct. There also
shouldn't be a performance tradeoff here.
Link: http://lkml.kernel.org/r/20171102125001.23708-1-vbabka@suse.cz
Fixes: 97a16fc82a ("mm, page_alloc: only enforce watermarks for order-0 allocations")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e048cb32f6 upstream.
The align_offset parameter is used by bitmap_find_next_zero_area_off()
to represent the offset of map's base from the previous alignment
boundary; the function ensures that the returned index, plus the
align_offset, honors the specified align_mask.
The logic introduced by commit b5be83e308 ("mm: cma: align to physical
address, not CMA region position") has the cma driver calculate the
offset to the *next* alignment boundary. In most cases, the base
alignment is greater than that specified when making allocations,
resulting in a zero offset whether we align up or down. In the example
given with the commit, the base alignment (8MB) was half the requested
alignment (16MB) so the math also happened to work since the offset is
8MB in both directions. However, when requesting allocations with an
alignment greater than twice that of the base, the returned index would
not be correctly aligned.
Also, the align_order arguments of cma_bitmap_aligned_mask() and
cma_bitmap_aligned_offset() should not be negative so the argument type
was made unsigned.
Fixes: b5be83e308 ("mm: cma: align to physical address, not CMA region position")
Link: http://lkml.kernel.org/r/20170628170742.2895-1-opendmb@gmail.com
Signed-off-by: Angus Clark <angus@angusclark.org>
Signed-off-by: Doug Berger <opendmb@gmail.com>
Acked-by: Gregory Fong <gregory.0xf0@gmail.com>
Cc: Doug Berger <opendmb@gmail.com>
Cc: Angus Clark <angus@angusclark.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Lucas Stach <l.stach@pengutronix.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Shiraz Hashim <shashim@codeaurora.org>
Cc: Jaewon Kim <jaewon31.kim@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 18365225f0 upstream.
Laurent Dufour has noticed that hwpoinsoned pages are kept charged. In
his particular case he has hit a bad_page("page still charged to
cgroup") when onlining a hwpoison page. While this looks like something
that shouldn't happen in the first place because onlining hwpages and
returning them to the page allocator makes only little sense it shows a
real problem.
hwpoison pages do not get freed usually so we do not uncharge them (at
least not since commit 0a31bc97c8 ("mm: memcontrol: rewrite uncharge
API")). Each charge pins memcg (since e8ea14cc6e ("mm: memcontrol:
take a css reference for each charged page")) as well and so the
mem_cgroup and the associated state will never go away. Fix this leak
by forcibly uncharging a LRU hwpoisoned page in delete_from_lru_cache().
We also have to tweak uncharge_list because it cannot rely on zero ref
count for these pages.
[akpm@linux-foundation.org: coding-style fixes]
Fixes: 0a31bc97c8 ("mm: memcontrol: rewrite uncharge API")
Link: http://lkml.kernel.org/r/20170502185507.GB19165@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 561b5e0709 upstream.
Commit 1be7107fbe ("mm: larger stack guard gap, between vmas") has
introduced a regression in some rust and Java environments which are
trying to implement their own stack guard page. They are punching a new
MAP_FIXED mapping inside the existing stack Vma.
This will confuse expand_{downwards,upwards} into thinking that the
stack expansion would in fact get us too close to an existing non-stack
vma which is a correct behavior wrt safety. It is a real regression on
the other hand.
Let's work around the problem by considering PROT_NONE mapping as a part
of the stack. This is a gros hack but overflowing to such a mapping
would trap anyway an we only can hope that usespace knows what it is
doing and handle it propely.
Fixes: 1be7107fbe ("mm: larger stack guard gap, between vmas")
Link: http://lkml.kernel.org/r/20170705182849.GA18027@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Debugged-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 55877ef45f upstream.
ARM64 enables both CONFIG_OF and CONFIG_ACPI and the firmware can pass
both ACPI tables and the device tree. Based on the kernel parameter, one
of the two will be chosen. If acpi is enabled, then device tree is not
unflattened.
Currently ARM64 platforms report:
"
Failed to find cpu0 device node
Unable to detect cache hierarchy from DT for CPU 0
"
which is incorrect when booting with ACPI. Also latest ACPI v6.1 has no
support for cache properties/hierarchy.
This patch adds check for unflattened device tree and also returns as
"not supported" if ACPI is runtime enabled.
It also removes the reference to DT from the error message as the cache
hierarchy can be detected from the firmware(OF/DT/ACPI)
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Mian Yousaf Kaukab <yousaf.kaukab@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fac5148257 upstream.
With CONFIG_OF enabled on x86, we get the following error on boot:
"
Failed to find cpu0 device node
Unable to detect cache hierarchy from DT for CPU 0
"
and the cacheinfo fails to get populated in the corresponding sysfs
entries. This is because cache_setup_of_node looks for of_node for
setting up the shared cpu_map without checking that it's already
populated in the architecture specific callback.
In order to indicate that the shared cpu_map is already populated, this
patch introduces a boolean `cpu_map_populated` in struct cpu_cacheinfo
that can be used by the generic code to skip cache_shared_cpu_map_setup.
This patch also sets that boolean for x86.
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Mian Yousaf Kaukab <yousaf.kaukab@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 88d879d29f upstream.
Newer hardware has uncovered a bug in the software implementation of
using MWAITX for the delay function. A value of 0 for the timer is meant
to indicate that a timeout will not be used to exit MWAITX. On newer
hardware this can result in MWAITX never returning, resulting in NMI
soft lockup messages being printed. On older hardware, some of the other
conditions under which MWAITX can exit masked this issue. The AMD APM
does not currently document this and will be updated.
Please refer to http://marc.info/?l=kvm&m=148950623231140 for
information regarding NMI soft lockup messages on an AMD Ryzen 1800X.
This has been root-caused as a 0 passed to MWAITX causing it to wait
indefinitely.
This change has the added benefit of avoiding the unnecessary setup of
MONITORX/MWAITX when the delay value is zero.
Signed-off-by: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
Link: http://lkml.kernel.org/r/1493156643-29366-1-git-send-email-Janakarajan.Natarajan@amd.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c507babf10 upstream.
KVM only supports PMD hugepages at stage 2 but doesn't actually check
that the provided hugepage memory pagesize is PMD_SIZE before populating
stage 2 entries.
In cases where the backing hugepage size is smaller than PMD_SIZE (such
as when using contiguous hugepages), KVM can end up creating stage 2
mappings that extend beyond the supplied memory.
Fix this by checking for the pagesize of userspace vma before creating
PMD hugepage at stage 2.
Fixes: 66b3923a1a ("arm64: hugetlb: add support for PTE contiguous bit")
Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e5dfa3f902 upstream.
The usbip userspace tools call sprintf()/snprintf() and don't check for
the return value which can lead the paths to overflow, truncating the
final file in the path.
More urgently, GCC 7 now warns that these aren't checked with
-Wformat-overflow, and with -Werror enabled in configure.ac, that makes
these tools unbuildable.
This patch fixes these problems by replacing sprintf() with snprintf() in
one place and adding checks for the return value of snprintf().
Reviewed-by: Peter Senna Tschudin <peter.senna@gmail.com>
Signed-off-by: Jonathan Dieter <jdieter@lesbg.com>
Acked-by: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit cfd6ed4537 upstream.
GCC 7 now warns when switch statements fall through implicitly, and with
-Werror enabled in configure.ac, that makes these tools unbuildable.
We fix this by notifying the compiler that this particular case statement
is meant to fall through.
Reviewed-by: Peter Senna Tschudin <peter.senna@gmail.com>
Signed-off-by: Jonathan Dieter <jdieter@lesbg.com>
Signed-off-by: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2f2d0088eb upstream.
When a client has a USB device attached over IP, the vhci_hcd driver is
locally leaking a socket pointer address via the
/sys/devices/platform/vhci_hcd/status file (world-readable) and in debug
output when "usbip --debug port" is run.
Fix it to not leak. The socket pointer address is not used at the moment
and it was made visible as a convenient way to find IP address from socket
pointer address by looking up /proc/net/{tcp,tcp6}.
As this opens a security hole, the fix replaces socket pointer address with
sockfd.
Reported-by: Secunia Research <vuln@secunia.com>
Signed-off-by: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a0ec1ded22 upstream.
In orangefs_devreq_read, there is a loop which picks an op off the list
of pending ops. If the loop fails to find an op, there is nothing to
read, and it returns EAGAIN. If the op has been given up on, the loop
is restarted via a goto. The bug is that the variable which the found
op is written to is not reinitialized, so if there are no more eligible
ops on the list, the code runs again on the already handled op.
This is triggered by interrupting a process while the op is being copied
to the client-core. It's a fairly small window, but it's there.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0a5191efe0 upstream.
Since commit aef9a7bd9b ("serial/uart/8250: Add tunable RX interrupt
trigger I/F of FIFO buffers"), the port's default FCR value isn't used
in serial8250_do_set_termios anymore, but copied over once in
serial8250_config_port and then modified as needed.
Unfortunately, serial8250_config_port will never be called if the port
is shared between kernel and userspace, and the port's flag doesn't have
UPF_BOOT_AUTOCONF, which would trigger a serial8250_config_port as well.
This causes garbled output from userspace:
[ 5.220000] random: procd urandom read with 49 bits of entropy available
ers
[kee
Fix this by forcing it to be configured on boot, resulting in the
expected output:
[ 5.250000] random: procd urandom read with 50 bits of entropy available
Press the [f] key and hit [enter] to enter failsafe mode
Press the [1], [2], [3] or [4] key and hit [enter] to select the debug level
Fixes: aef9a7bd9b ("serial/uart/8250: Add tunable RX interrupt trigger I/F of FIFO buffers")
Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Nicolas Schichan <nschichan@freebox.fr>
Cc: linux-mips@linux-mips.org
Cc: linux-serial@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/17544/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Cc: James Hogan <jhogan@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit acfb3b883f upstream.
KVM doesn't follow the SMCCC when it comes to unimplemented calls,
and inject an UNDEF instead of returning an error. Since firmware
calls are now used for security mitigation, they are becoming more
common, and the undef is counter productive.
Instead, let's follow the SMCCC which states that -1 must be returned
to the caller when getting an unknown function number.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 490ae017f5 upstream.
For btree removal, there is a corner case that a single thread
could takes 6 locks which is more than THIN_MAX_CONCURRENT_LOCKS(5)
and leads to deadlock.
A btree removal might eventually call
rebalance_children()->rebalance3() to rebalance entries of three
neighbor child nodes when shadow_spine has already acquired two
write locks. In rebalance3(), it tries to shadow and acquire the
write locks of all three child nodes. However, shadowing a child
node requires acquiring a read lock of the original child node and
a write lock of the new block. Although the read lock will be
released after block shadowing, shadowing the third child node
in rebalance3() could still take the sixth lock.
(2 write locks for shadow_spine +
2 write locks for the first two child nodes's shadow +
1 write lock for the last child node's shadow +
1 read lock for the last child node)
Signed-off-by: Dennis Yang <dennisyang@qnap.com>
Acked-by: Joe Thornber <thornber@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit bc68d0a435 upstream.
When inserting a new key/value pair into a btree we walk down the spine of
btree nodes performing the following 2 operations:
i) space for a new entry
ii) adjusting the first key entry if the new key is lower than any in the node.
If the _root_ node is full, the function btree_split_beneath() allocates 2 new
nodes, and redistibutes the root nodes entries between them. The root node is
left with 2 entries corresponding to the 2 new nodes.
btree_split_beneath() then adjusts the spine to point to one of the two new
children. This means the first key is never adjusted if the new key was lower,
ie. operation (ii) gets missed out. This can result in the new key being
'lost' for a period; until another low valued key is inserted that will uncover
it.
This is a serious bug, and quite hard to make trigger in normal use. A
reproducing test case ("thin create devices-in-reverse-order") is
available as part of the thin-provision-tools project:
https://github.com/jthornber/thin-provisioning-tools/blob/master/functional-tests/device-mapper/dm-tests.scm#L593
Fix the issue by changing btree_split_beneath() so it no longer adjusts
the spine. Instead it unlocks both the new nodes, and lets the main
loop in btree_insert_raw() relock the appropriate one and make any
neccessary adjustments.
Reported-by: Monty Pavel <monty_pavel@sina.com>
Signed-off-by: Joe Thornber <thornber@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>