Commit Graph

64639 Commits

Author SHA1 Message Date
Tejun Heo
28a4e95363 workqueue: fix hang involving racing cancel[_delayed]_work_sync()'s for PREEMPT_NONE
commit 8603e1b300 upstream.

cancel[_delayed]_work_sync() are implemented using
__cancel_work_timer() which grabs the PENDING bit using
try_to_grab_pending() and then flushes the work item with PENDING set
to prevent the on-going execution of the work item from requeueing
itself.

try_to_grab_pending() can always grab PENDING bit without blocking
except when someone else is doing the above flushing during
cancelation.  In that case, try_to_grab_pending() returns -ENOENT.  In
this case, __cancel_work_timer() currently invokes flush_work().  The
assumption is that the completion of the work item is what the other
canceling task would be waiting for too and thus waiting for the same
condition and retrying should allow forward progress without excessive
busy looping

Unfortunately, this doesn't work if preemption is disabled or the
latter task has real time priority.  Let's say task A just got woken
up from flush_work() by the completion of the target work item.  If,
before task A starts executing, task B gets scheduled and invokes
__cancel_work_timer() on the same work item, its try_to_grab_pending()
will return -ENOENT as the work item is still being canceled by task A
and flush_work() will also immediately return false as the work item
is no longer executing.  This puts task B in a busy loop possibly
preventing task A from executing and clearing the canceling state on
the work item leading to a hang.

task A			task B			worker

						executing work
__cancel_work_timer()
  try_to_grab_pending()
  set work CANCELING
  flush_work()
    block for work completion
						completion, wakes up A
			__cancel_work_timer()
			while (forever) {
			  try_to_grab_pending()
			    -ENOENT as work is being canceled
			  flush_work()
			    false as work is no longer executing
			}

This patch removes the possible hang by updating __cancel_work_timer()
to explicitly wait for clearing of CANCELING rather than invoking
flush_work() after try_to_grab_pending() fails with -ENOENT.

Link: http://lkml.kernel.org/g/20150206171156.GA8942@axis.com

v3: bit_waitqueue() can't be used for work items defined in vmalloc
    area.  Switched to custom wake function which matches the target
    work item and exclusive wait and wakeup.

v2: v1 used wake_up() on bit_waitqueue() which leads to NULL deref if
    the target bit waitqueue has wait_bit_queue's on it.  Use
    DEFINE_WAIT_BIT() and __wake_up_bit() instead.  Reported by Tomeu
    Vizoso.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Rabin Vincent <rabin.vincent@axis.com>
Cc: Tomeu Vizoso <tomeu.vizoso@gmail.com>
Tested-by: Jesper Nilsson <jesper.nilsson@axis.com>
Tested-by: Rabin Vincent <rabin.vincent@axis.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-26 15:06:54 +01:00
Nicholas Bellinger
bdbd61e174 target: Fix PR_APTPL_BUF_LEN buffer size limitation
commit f161d4b44d upstream.

This patch addresses the original PR_APTPL_BUF_LEN = 8k limitiation
for write-out of PR APTPL metadata that Martin has recently been
running into.

It changes core_scsi3_update_and_write_aptpl() to use vzalloc'ed
memory instead of kzalloc, and increases the default hardcoded
length to 256k.

It also adds logic in core_scsi3_update_and_write_aptpl() to double
the original length upon core_scsi3_update_aptpl_buf() failure, and
retries until the vzalloc'ed buffer is large enough to accommodate
the outgoing APTPL metadata.

Reported-by: Martin Svec <martin.svec@zoner.cz>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-18 13:31:24 +01:00
Vlastimil Babka
a4f3f96f85 mm: when stealing freepages, also take pages created by splitting buddy page
commit 99592d598e upstream.

When studying page stealing, I noticed some weird looking decisions in
try_to_steal_freepages().  The first I assume is a bug (Patch 1), the
following two patches were driven by evaluation.

Testing was done with stress-highalloc of mmtests, using the
mm_page_alloc_extfrag tracepoint and postprocessing to get counts of how
often page stealing occurs for individual migratetypes, and what
migratetypes are used for fallbacks.  Arguably, the worst case of page
stealing is when UNMOVABLE allocation steals from MOVABLE pageblock.
RECLAIMABLE allocation stealing from MOVABLE allocation is also not ideal,
so the goal is to minimize these two cases.

The evaluation of v2 wasn't always clear win and Joonsoo questioned the
results.  Here I used different baseline which includes RFC compaction
improvements from [1].  I found that the compaction improvements reduce
variability of stress-highalloc, so there's less noise in the data.

First, let's look at stress-highalloc configured to do sync compaction,
and how these patches reduce page stealing events during the test.  First
column is after fresh reboot, other two are reiterations of test without
reboot.  That was all accumulater over 5 re-iterations (so the benchmark
was run 5x3 times with 5 fresh restarts).

Baseline:

                                                   3.19-rc4        3.19-rc4        3.19-rc4
                                                  5-nothp-1       5-nothp-2       5-nothp-3
Page alloc extfrag event                               10264225     8702233    10244125
Extfrag fragmenting                                    10263271     8701552    10243473
Extfrag fragmenting for unmovable                         13595       17616       15960
Extfrag fragmenting unmovable placed with movable          7989       12193        8447
Extfrag fragmenting for reclaimable                         658        1840        1817
Extfrag fragmenting reclaimable placed with movable         558        1677        1679
Extfrag fragmenting for movable                        10249018     8682096    10225696

With Patch 1:
                                                   3.19-rc4        3.19-rc4        3.19-rc4
                                                  6-nothp-1       6-nothp-2       6-nothp-3
Page alloc extfrag event                               11834954     9877523     9774860
Extfrag fragmenting                                    11833993     9876880     9774245
Extfrag fragmenting for unmovable                          7342       16129       11712
Extfrag fragmenting unmovable placed with movable          4191       10547        6270
Extfrag fragmenting for reclaimable                         373        1130         923
Extfrag fragmenting reclaimable placed with movable         302         906         738
Extfrag fragmenting for movable                        11826278     9859621     9761610

With Patch 2:
                                                   3.19-rc4        3.19-rc4        3.19-rc4
                                                  7-nothp-1       7-nothp-2       7-nothp-3
Page alloc extfrag event                                4725990     3668793     3807436
Extfrag fragmenting                                     4725104     3668252     3806898
Extfrag fragmenting for unmovable                          6678        7974        7281
Extfrag fragmenting unmovable placed with movable          2051        3829        4017
Extfrag fragmenting for reclaimable                         429        1208        1278
Extfrag fragmenting reclaimable placed with movable         369         976        1034
Extfrag fragmenting for movable                         4717997     3659070     3798339

With Patch 3:
                                                   3.19-rc4        3.19-rc4        3.19-rc4
                                                  8-nothp-1       8-nothp-2       8-nothp-3
Page alloc extfrag event                                5016183     4700142     3850633
Extfrag fragmenting                                     5015325     4699613     3850072
Extfrag fragmenting for unmovable                          1312        3154        3088
Extfrag fragmenting unmovable placed with movable          1115        2777        2714
Extfrag fragmenting for reclaimable                         437        1193        1097
Extfrag fragmenting reclaimable placed with movable         330         969         879
Extfrag fragmenting for movable                         5013576     4695266     3845887

In v2 we've seen apparent regression with Patch 1 for unmovable events,
this is now gone, suggesting it was indeed noise.  Here, each patch
improves the situation for unmovable events.  Reclaimable is improved by
patch 1 and then either the same modulo noise, or perhaps sligtly worse -
a small price for unmovable improvements, IMHO.  The number of movable
allocations falling back to other migratetypes is most noisy, but it's
reduced to half at Patch 2 nevertheless.  These are least critical as
compaction can move them around.

If we look at success rates, the patches don't affect them, that didn't change.

Baseline:
                             3.19-rc4              3.19-rc4              3.19-rc4
                            5-nothp-1             5-nothp-2             5-nothp-3
Success 1 Min         49.00 (  0.00%)       42.00 ( 14.29%)       41.00 ( 16.33%)
Success 1 Mean        51.00 (  0.00%)       45.00 ( 11.76%)       42.60 ( 16.47%)
Success 1 Max         55.00 (  0.00%)       51.00 (  7.27%)       46.00 ( 16.36%)
Success 2 Min         53.00 (  0.00%)       47.00 ( 11.32%)       44.00 ( 16.98%)
Success 2 Mean        59.60 (  0.00%)       50.80 ( 14.77%)       48.20 ( 19.13%)
Success 2 Max         64.00 (  0.00%)       56.00 ( 12.50%)       52.00 ( 18.75%)
Success 3 Min         84.00 (  0.00%)       82.00 (  2.38%)       78.00 (  7.14%)
Success 3 Mean        85.60 (  0.00%)       82.80 (  3.27%)       79.40 (  7.24%)
Success 3 Max         86.00 (  0.00%)       83.00 (  3.49%)       80.00 (  6.98%)

Patch 1:
                             3.19-rc4              3.19-rc4              3.19-rc4
                            6-nothp-1             6-nothp-2             6-nothp-3
Success 1 Min         49.00 (  0.00%)       44.00 ( 10.20%)       44.00 ( 10.20%)
Success 1 Mean        51.80 (  0.00%)       46.00 ( 11.20%)       45.80 ( 11.58%)
Success 1 Max         54.00 (  0.00%)       49.00 (  9.26%)       49.00 (  9.26%)
Success 2 Min         58.00 (  0.00%)       49.00 ( 15.52%)       48.00 ( 17.24%)
Success 2 Mean        60.40 (  0.00%)       51.80 ( 14.24%)       50.80 ( 15.89%)
Success 2 Max         63.00 (  0.00%)       54.00 ( 14.29%)       55.00 ( 12.70%)
Success 3 Min         84.00 (  0.00%)       81.00 (  3.57%)       79.00 (  5.95%)
Success 3 Mean        85.00 (  0.00%)       81.60 (  4.00%)       79.80 (  6.12%)
Success 3 Max         86.00 (  0.00%)       82.00 (  4.65%)       82.00 (  4.65%)

Patch 2:

                             3.19-rc4              3.19-rc4              3.19-rc4
                            7-nothp-1             7-nothp-2             7-nothp-3
Success 1 Min         50.00 (  0.00%)       44.00 ( 12.00%)       39.00 ( 22.00%)
Success 1 Mean        52.80 (  0.00%)       45.60 ( 13.64%)       42.40 ( 19.70%)
Success 1 Max         55.00 (  0.00%)       46.00 ( 16.36%)       47.00 ( 14.55%)
Success 2 Min         52.00 (  0.00%)       48.00 (  7.69%)       45.00 ( 13.46%)
Success 2 Mean        53.40 (  0.00%)       49.80 (  6.74%)       48.80 (  8.61%)
Success 2 Max         57.00 (  0.00%)       52.00 (  8.77%)       52.00 (  8.77%)
Success 3 Min         84.00 (  0.00%)       81.00 (  3.57%)       79.00 (  5.95%)
Success 3 Mean        85.00 (  0.00%)       82.40 (  3.06%)       79.60 (  6.35%)
Success 3 Max         86.00 (  0.00%)       83.00 (  3.49%)       80.00 (  6.98%)

Patch 3:
                             3.19-rc4              3.19-rc4              3.19-rc4
                            8-nothp-1             8-nothp-2             8-nothp-3
Success 1 Min         46.00 (  0.00%)       44.00 (  4.35%)       42.00 (  8.70%)
Success 1 Mean        50.20 (  0.00%)       45.60 (  9.16%)       44.00 ( 12.35%)
Success 1 Max         52.00 (  0.00%)       47.00 (  9.62%)       47.00 (  9.62%)
Success 2 Min         53.00 (  0.00%)       49.00 (  7.55%)       48.00 (  9.43%)
Success 2 Mean        55.80 (  0.00%)       50.60 (  9.32%)       49.00 ( 12.19%)
Success 2 Max         59.00 (  0.00%)       52.00 ( 11.86%)       51.00 ( 13.56%)
Success 3 Min         84.00 (  0.00%)       80.00 (  4.76%)       79.00 (  5.95%)
Success 3 Mean        85.40 (  0.00%)       81.60 (  4.45%)       80.40 (  5.85%)
Success 3 Max         87.00 (  0.00%)       83.00 (  4.60%)       82.00 (  5.75%)

While there's no improvement here, I consider reduced fragmentation events
to be worth on its own.  Patch 2 also seems to reduce scanning for free
pages, and migrations in compaction, suggesting it has somewhat less work
to do:

Patch 1:

Compaction stalls                 4153        3959        3978
Compaction success                1523        1441        1446
Compaction failures               2630        2517        2531
Page migrate success           4600827     4943120     5104348
Page migrate failure             19763       16656       17806
Compaction pages isolated      9597640    10305617    10653541
Compaction migrate scanned    77828948    86533283    87137064
Compaction free scanned      517758295   521312840   521462251
Compaction cost                   5503        5932        6110

Patch 2:

Compaction stalls                 3800        3450        3518
Compaction success                1421        1316        1317
Compaction failures               2379        2134        2201
Page migrate success           4160421     4502708     4752148
Page migrate failure             19705       14340       14911
Compaction pages isolated      8731983     9382374     9910043
Compaction migrate scanned    98362797    96349194    98609686
Compaction free scanned      496512560   469502017   480442545
Compaction cost                   5173        5526        5811

As with v2, /proc/pagetypeinfo appears unaffected with respect to numbers
of unmovable and reclaimable pageblocks.

Configuring the benchmark to allocate like THP page fault (i.e.  no sync
compaction) gives much noisier results for iterations 2 and 3 after
reboot.  This is not so surprising given how [1] offers lower improvements
in this scenario due to less restarts after deferred compaction which
would change compaction pivot.

Baseline:
                                                   3.19-rc4        3.19-rc4        3.19-rc4
                                                    5-thp-1         5-thp-2         5-thp-3
Page alloc extfrag event                                8148965     6227815     6646741
Extfrag fragmenting                                     8147872     6227130     6646117
Extfrag fragmenting for unmovable                         10324       12942       15975
Extfrag fragmenting unmovable placed with movable          5972        8495       10907
Extfrag fragmenting for reclaimable                         601        1707        2210
Extfrag fragmenting reclaimable placed with movable         520        1570        2000
Extfrag fragmenting for movable                         8136947     6212481     6627932

Patch 1:
                                                   3.19-rc4        3.19-rc4        3.19-rc4
                                                    6-thp-1         6-thp-2         6-thp-3
Page alloc extfrag event                                8345457     7574471     7020419
Extfrag fragmenting                                     8343546     7573777     7019718
Extfrag fragmenting for unmovable                         10256       18535       30716
Extfrag fragmenting unmovable placed with movable          6893       11726       22181
Extfrag fragmenting for reclaimable                         465        1208        1023
Extfrag fragmenting reclaimable placed with movable         353         996         843
Extfrag fragmenting for movable                         8332825     7554034     6987979

Patch 2:
                                                   3.19-rc4        3.19-rc4        3.19-rc4
                                                    7-thp-1         7-thp-2         7-thp-3
Page alloc extfrag event                                3512847     3020756     2891625
Extfrag fragmenting                                     3511940     3020185     2891059
Extfrag fragmenting for unmovable                          9017        6892        6191
Extfrag fragmenting unmovable placed with movable          1524        3053        2435
Extfrag fragmenting for reclaimable                         445        1081        1160
Extfrag fragmenting reclaimable placed with movable         375         918         986
Extfrag fragmenting for movable                         3502478     3012212     2883708

Patch 3:
                                                   3.19-rc4        3.19-rc4        3.19-rc4
                                                    8-thp-1         8-thp-2         8-thp-3
Page alloc extfrag event                                3181699     3082881     2674164
Extfrag fragmenting                                     3180812     3082303     2673611
Extfrag fragmenting for unmovable                          1201        4031        4040
Extfrag fragmenting unmovable placed with movable           974        3611        3645
Extfrag fragmenting for reclaimable                         478        1165        1294
Extfrag fragmenting reclaimable placed with movable         387         985        1030
Extfrag fragmenting for movable                         3179133     3077107     2668277

The improvements for first iteration are clear, the rest is much noisier
and can appear like regression for Patch 1.  Anyway, patch 2 rectifies it.

Allocation success rates are again unaffected so there's no point in
making this e-mail any longer.

[1] http://marc.info/?l=linux-mm&m=142166196321125&w=2

This patch (of 3):

When __rmqueue_fallback() is called to allocate a page of order X, it will
find a page of order Y >= X of a fallback migratetype, which is different
from the desired migratetype.  With the help of try_to_steal_freepages(),
it may change the migratetype (to the desired one) also of:

1) all currently free pages in the pageblock containing the fallback page
2) the fallback pageblock itself
3) buddy pages created by splitting the fallback page (when Y > X)

These decisions take the order Y into account, as well as the desired
migratetype, with the goal of preventing multiple fallback allocations
that could e.g.  distribute UNMOVABLE allocations among multiple
pageblocks.

Originally, decision for 1) has implied the decision for 3).  Commit
47118af076 ("mm: mmzone: MIGRATE_CMA migration type added") changed that
(probably unintentionally) so that the buddy pages in case 3) are always
changed to the desired migratetype, except for CMA pageblocks.

Commit fef903efcf ("mm/page_allo.c: restructure free-page stealing code
and fix a bug") did some refactoring and added a comment that the case of
3) is intended.  Commit 0cbef29a78 ("mm: __rmqueue_fallback() should
respect pageblock type") removed the comment and tried to restore the
original behavior where 1) implies 3), but due to the previous
refactoring, the result is instead that only 2) implies 3) - and the
conditions for 2) are less frequently met than conditions for 1).  This
may increase fragmentation in situations where the code decides to steal
all free pages from the pageblock (case 1)), but then gives back the buddy
pages produced by splitting.

This patch restores the original intended logic where 1) implies 3).
During testing with stress-highalloc from mmtests, this has shown to
decrease the number of events where UNMOVABLE and RECLAIMABLE allocations
steal from MOVABLE pageblocks, which can lead to permanent fragmentation.
In some cases it has increased the number of events when MOVABLE
allocations steal from UNMOVABLE or RECLAIMABLE pageblocks, but these are
fixable by sync compaction and thus less harmful.

Note that evaluation has shown that the behavior introduced by
47118af076 for buddy pages in case 3) is actually even better than the
original logic, so the following patch will introduce it properly once
again.  For stable backports of this patch it makes thus sense to only fix
versions containing 0cbef29a78.

[iamjoonsoo.kim@lge.com: tracepoint fix]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-18 13:31:23 +01:00
Sebastian Andrzej Siewior
df24955beb usb: core: buffer: smallest buffer should start at ARCH_DMA_MINALIGN
commit 5efd2ea8c9 upstream.

the following error pops up during "testusb -a -t 10"
| musb-hdrc musb-hdrc.1.auto: dma_pool_free buffer-128,	f134e000/be842000 (bad dma)
hcd_buffer_create() creates a few buffers, the smallest has 32 bytes of
size. ARCH_KMALLOC_MINALIGN is set to 64 bytes. This combo results in
hcd_buffer_alloc() returning memory which is 32 bytes aligned and it
might by identified by buffer_offset() as another buffer. This means the
buffer which is on a 32 byte boundary will not get freed, instead it
tries to free another buffer with the error message.

This patch fixes the issue by creating the smallest DMA buffer with the
size of ARCH_KMALLOC_MINALIGN (or 32 in case ARCH_KMALLOC_MINALIGN is
smaller). This might be 32, 64 or even 128 bytes. The next three pools
will have the size 128, 512 and 2048.
In case the smallest pool is 128 bytes then we have only three pools
instead of four (and zero the first entry in the array).
The last pool size is always 2048 bytes which is the assumed PAGE_SIZE /
2 of 4096. I doubt it makes sense to continue using PAGE_SIZE / 2 where
we would end up with 8KiB buffer in case we have 16KiB pages.
Instead I think it makes sense to have a common size(s) and extend them
if there is need to.
There is a BUILD_BUG_ON() now in case someone has a minalign of more than
128 bytes.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-06 14:43:30 -08:00
Jan Kara
d4b931034c fsnotify: fix handling of renames in audit
commit 6ee8e25fc3 upstream.

Commit e9fd702a58 ("audit: convert audit watches to use fsnotify
instead of inotify") broke handling of renames in audit.  Audit code
wants to update inode number of an inode corresponding to watched name
in a directory.  When something gets renamed into a directory to a
watched name, inotify previously passed moved inode to audit code
however new fsnotify code passes directory inode where the change
happened.  That confuses audit and it starts watching parent directory
instead of a file in a directory.

This can be observed for example by doing:

  cd /tmp
  touch foo bar
  auditctl -w /tmp/foo
  touch foo
  mv bar foo
  touch foo

In audit log we see events like:

  type=CONFIG_CHANGE msg=audit(1423563584.155:90): auid=1000 ses=2 op="updated rules" path="/tmp/foo" key=(null) list=4 res=1
  ...
  type=PATH msg=audit(1423563584.155:91): item=2 name="bar" inode=1046884 dev=08:0 2 mode=0100644 ouid=0 ogid=0 rdev=00:00 nametype=DELETE
  type=PATH msg=audit(1423563584.155:91): item=3 name="foo" inode=1046842 dev=08:0 2 mode=0100644 ouid=0 ogid=0 rdev=00:00 nametype=DELETE
  type=PATH msg=audit(1423563584.155:91): item=4 name="foo" inode=1046884 dev=08:0 2 mode=0100644 ouid=0 ogid=0 rdev=00:00 nametype=CREATE
  ...

and that's it - we see event for the first touch after creating the
audit rule, we see events for rename but we don't see any event for the
last touch.  However we start seeing events for unrelated stuff
happening in /tmp.

Fix the problem by passing moved inode as data in the FS_MOVED_FROM and
FS_MOVED_TO events instead of the directory where the change happens.
This doesn't introduce any new problems because noone besides
audit_watch.c cares about the passed value:

  fs/notify/fanotify/fanotify.c cares only about FSNOTIFY_EVENT_PATH events.
  fs/notify/dnotify/dnotify.c doesn't care about passed 'data' value at all.
  fs/notify/inotify/inotify_fsnotify.c uses 'data' only for FSNOTIFY_EVENT_PATH.
  kernel/audit_tree.c doesn't care about passed 'data' at all.
  kernel/audit_watch.c expects moved inode as 'data'.

Fixes: e9fd702a58 ("audit: convert audit watches to use fsnotify instead of inotify")
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-06 14:43:23 -08:00
Eric Dumazet
94caec8ce7 ipv4: tcp: get rid of ugly unicast_sock
[ Upstream commit bdbbb8527b ]

In commit be9f4a44e7 ("ipv4: tcp: remove per net tcp_sock")
I tried to address contention on a socket lock, but the solution
I chose was horrible :

commit 3a7c384ffd ("ipv4: tcp: unicast_sock should not land outside
of TCP stack") addressed a selinux regression.

commit 0980e56e50 ("ipv4: tcp: set unicast_sock uc_ttl to -1")
took care of another regression.

commit b5ec8eeac4 ("ipv4: fix ip_send_skb()") fixed another regression.

commit 811230cd85 ("tcp: ipv4: initialize unicast_sock sk_pacing_rate")
was another shot in the dark.

Really, just use a proper socket per cpu, and remove the skb_orphan()
call, to re-enable flow control.

This solves a serious problem with FQ packet scheduler when used in
hostile environments, as we do not want to allocate a flow structure
for every RST packet sent in response to a spoofed packet.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-02-26 17:50:12 -08:00
Hannes Frederic Sowa
ee6db0ad53 ipv4: try to cache dst_entries which would cause a redirect
[ Upstream commit df4d92549f ]

Not caching dst_entries which cause redirects could be exploited by hosts
on the same subnet, causing a severe DoS attack. This effect aggravated
since commit f886497212 ("ipv4: fix dst race in sk_dst_get()").

Lookups causing redirects will be allocated with DST_NOCACHE set which
will force dst_release to free them via RCU.  Unfortunately waiting for
RCU grace period just takes too long, we can end up with >1M dst_entries
waiting to be released and the system will run OOM. rcuos threads cannot
catch up under high softirq load.

Attaching the flag to emit a redirect later on to the specific skb allows
us to cache those dst_entries thus reducing the pressure on allocation
and deallocation.

This issue was discovered by Marcelo Leitner.

Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Marcelo Leitner <mleitner@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-02-26 17:50:10 -08:00
Takashi Iwai
2451dcefc2 ALSA: ak411x: Fix stall in work callback
commit 4161b4505f upstream.

When ak4114 work calls its callback and the callback invokes
ak4114_reinit(), it stalls due to flush_delayed_work().  For avoiding
this, control the reentrance by introducing a refcount.  Also
flush_delayed_work() is replaced with cancel_delayed_work_sync().

The exactly same bug is present in ak4113.c and fixed as well.

Reported-by: Pavel Hofman <pavel.hofman@ivitera.com>
Acked-by: Jaroslav Kysela <perex@perex.cz>
Tested-by: Pavel Hofman <pavel.hofman@ivitera.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-02-11 14:54:49 +08:00
Michal Hocko
525f2d0136 mm: get rid of radix tree gfp mask for pagecache_get_page
commit 45f87de57f upstream.

Commit 2457aec637 ("mm: non-atomically mark page accessed during page
cache allocation where possible") has added a separate parameter for
specifying gfp mask for radix tree allocations.

Not only this is less than optimal from the API point of view because it
is error prone, it is also buggy currently because
grab_cache_page_write_begin is using GFP_KERNEL for radix tree and if
fgp_flags doesn't contain FGP_NOFS (mostly controlled by fs by
AOP_FLAG_NOFS flag) but the mapping_gfp_mask has __GFP_FS cleared then
the radix tree allocation wouldn't obey the restriction and might
recurse into filesystem and cause deadlocks.  This is the case for most
filesystems unfortunately because only ext4 and gfs2 are using
AOP_FLAG_NOFS.

Let's simply remove radix_gfp_mask parameter because the allocation
context is same for both page cache and for the radix tree.  Just make
sure that the radix tree gets only the sane subset of the mask (e.g.  do
not pass __GFP_WRITE).

Long term it is more preferable to convert remaining users of
AOP_FLAG_NOFS to use mapping_gfp_mask instead and simplify this
interface even further.

Reported-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29 17:40:53 -08:00
Mel Gorman
a0d134d3e4 mm: page_alloc: reduce cost of the fair zone allocation policy
commit 4ffeaf3560 upstream.

The fair zone allocation policy round-robins allocations between zones
within a node to avoid age inversion problems during reclaim.  If the
first allocation fails, the batch counts are reset and a second attempt
made before entering the slow path.

One assumption made with this scheme is that batches expire at roughly
the same time and the resets each time are justified.  This assumption
does not hold when zones reach their low watermark as the batches will
be consumed at uneven rates.  Allocation failure due to watermark
depletion result in additional zonelist scans for the reset and another
watermark check before hitting the slowpath.

On UMA, the benefit is negligible -- around 0.25%.  On 4-socket NUMA
machine it's variable due to the variability of measuring overhead with
the vmstat changes.  The system CPU overhead comparison looks like

          3.16.0-rc3  3.16.0-rc3  3.16.0-rc3
             vanilla   vmstat-v5 lowercost-v5
User          746.94      774.56      802.00
System      65336.22    32847.27    40852.33
Elapsed     27553.52    27415.04    27368.46

However it is worth noting that the overall benchmark still completed
faster and intuitively it makes sense to take as few passes as possible
through the zonelists.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29 17:40:53 -08:00
Mel Gorman
e341f2a89a mm: move zone->pages_scanned into a vmstat counter
commit 0d5d823ab4 upstream.

zone->pages_scanned is a write-intensive cache line during page reclaim
and it's also updated during page free.  Move the counter into vmstat to
take advantage of the per-cpu updates and do not update it in the free
paths unless necessary.

On a small UMA machine running tiobench the difference is marginal.  On
a 4-node machine the overhead is more noticable.  Note that automatic
NUMA balancing was disabled for this test as otherwise the system CPU
overhead is unpredictable.

          3.16.0-rc3  3.16.0-rc3  3.16.0-rc3
             vanillarearrange-v5   vmstat-v5
User          746.94      759.78      774.56
System      65336.22    58350.98    32847.27
Elapsed     27553.52    27282.02    27415.04

Note that the overhead reduction will vary depending on where exactly
pages are allocated and freed.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29 17:40:53 -08:00
Mel Gorman
36806371a2 mm: rearrange zone fields into read-only, page alloc, statistics and page reclaim lines
commit 3484b2de94 upstream.

The arrangement of struct zone has changed over time and now it has
reached the point where there is some inappropriate sharing going on.
On x86-64 for example

o The zone->node field is shared with the zone lock and zone->node is
  accessed frequently from the page allocator due to the fair zone
  allocation policy.

o span_seqlock is almost never used by shares a line with free_area

o Some zone statistics share a cache line with the LRU lock so
  reclaim-intensive and allocator-intensive workloads can bounce the cache
  line on a stat update

This patch rearranges struct zone to put read-only and read-mostly
fields together and then splits the page allocator intensive fields, the
zone statistics and the page reclaim intensive fields into their own
cache lines.  Note that the type of lowmem_reserve changes due to the
watermark calculations being signed and avoiding a signed/unsigned
conversion there.

On the test configuration I used the overall size of struct zone shrunk
by one cache line.  On smaller machines, this is not likely to be
noticable.  However, on a 4-node NUMA machine running tiobench the
system CPU overhead is reduced by this patch.

          3.16.0-rc3  3.16.0-rc3
             vanillarearrange-v5r9
User          746.94      759.78
System      65336.22    58350.98
Elapsed     27553.52    27282.02

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29 17:40:52 -08:00
Mel Gorman
84532d7739 mm: pagemap: avoid unnecessary overhead when tracepoints are deactivated
commit 24b7e5819a upstream.

This was formerly the series "Improve sequential read throughput" which
noted some major differences in performance of tiobench since 3.0.
While there are a number of factors, two that dominated were the
introduction of the fair zone allocation policy and changes to CFQ.

The behaviour of fair zone allocation policy makes more sense than
tiobench as a benchmark and CFQ defaults were not changed due to
insufficient benchmarking.

This series is what's left.  It's one functional fix to the fair zone
allocation policy when used on NUMA machines and a reduction of overhead
in general.  tiobench was used for the comparison despite its flaws as
an IO benchmark as in this case we are primarily interested in the
overhead of page allocator and page reclaim activity.

On UMA, it makes little difference to overhead

          3.16.0-rc3   3.16.0-rc3
             vanilla lowercost-v5
User          383.61      386.77
System        403.83      401.74
Elapsed      5411.50     5413.11

On a 4-socket NUMA machine it's a bit more noticable

          3.16.0-rc3   3.16.0-rc3
             vanilla lowercost-v5
User          746.94      802.00
System      65336.22    40852.33
Elapsed     27553.52    27368.46

This patch (of 6):

The LRU insertion and activate tracepoints take PFN as a parameter
forcing the overhead to the caller.  Move the overhead to the tracepoint
fast-assign method to ensure the cost is only incurred when the
tracepoint is active.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29 17:40:52 -08:00
Jerome Marchand
63aefc47b6 mm: make copy_pte_range static again
commit 21bda264f4 upstream.

Commit 71e3aac072 ("thp: transparent hugepage core") adds
copy_pte_range prototype to huge_mm.h.  I'm not sure why (or if) this
function have been used outside of memory.c, but it currently isn't.
This patch makes copy_pte_range() static again.

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29 17:40:52 -08:00
Mel Gorman
35dbe179fe mm: non-atomically mark page accessed during page cache allocation where possible
commit 2457aec637 upstream.

aops->write_begin may allocate a new page and make it visible only to have
mark_page_accessed called almost immediately after.  Once the page is
visible the atomic operations are necessary which is noticable overhead
when writing to an in-memory filesystem like tmpfs but should also be
noticable with fast storage.  The objective of the patch is to initialse
the accessed information with non-atomic operations before the page is
visible.

The bulk of filesystems directly or indirectly use
grab_cache_page_write_begin or find_or_create_page for the initial
allocation of a page cache page.  This patch adds an init_page_accessed()
helper which behaves like the first call to mark_page_accessed() but may
called before the page is visible and can be done non-atomically.

The primary APIs of concern in this care are the following and are used
by most filesystems.

	find_get_page
	find_lock_page
	find_or_create_page
	grab_cache_page_nowait
	grab_cache_page_write_begin

All of them are very similar in detail to the patch creates a core helper
pagecache_get_page() which takes a flags parameter that affects its
behavior such as whether the page should be marked accessed or not.  Then
old API is preserved but is basically a thin wrapper around this core
function.

Each of the filesystems are then updated to avoid calling
mark_page_accessed when it is known that the VM interfaces have already
done the job.  There is a slight snag in that the timing of the
mark_page_accessed() has now changed so in rare cases it's possible a page
gets to the end of the LRU as PageReferenced where as previously it might
have been repromoted.  This is expected to be rare but it's worth the
filesystem people thinking about it in case they see a problem with the
timing change.  It is also the case that some filesystems may be marking
pages accessed that previously did not but it makes sense that filesystems
have consistent behaviour in this regard.

The test case used to evaulate this is a simple dd of a large file done
multiple times with the file deleted on each iterations.  The size of the
file is 1/10th physical memory to avoid dirty page balancing.  In the
async case it will be possible that the workload completes without even
hitting the disk and will have variable results but highlight the impact
of mark_page_accessed for async IO.  The sync results are expected to be
more stable.  The exception is tmpfs where the normal case is for the "IO"
to not hit the disk.

The test machine was single socket and UMA to avoid any scheduling or NUMA
artifacts.  Throughput and wall times are presented for sync IO, only wall
times are shown for async as the granularity reported by dd and the
variability is unsuitable for comparison.  As async results were variable
do to writback timings, I'm only reporting the maximum figures.  The sync
results were stable enough to make the mean and stddev uninteresting.

The performance results are reported based on a run with no profiling.
Profile data is based on a separate run with oprofile running.

async dd
                                    3.15.0-rc3            3.15.0-rc3
                                       vanilla           accessed-v2
ext3    Max      elapsed     13.9900 (  0.00%)     11.5900 ( 17.16%)
tmpfs	Max      elapsed      0.5100 (  0.00%)      0.4900 (  3.92%)
btrfs   Max      elapsed     12.8100 (  0.00%)     12.7800 (  0.23%)
ext4	Max      elapsed     18.6000 (  0.00%)     13.3400 ( 28.28%)
xfs	Max      elapsed     12.5600 (  0.00%)      2.0900 ( 83.36%)

The XFS figure is a bit strange as it managed to avoid a worst case by
sheer luck but the average figures looked reasonable.

        samples percentage
ext3       86107    0.9783  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
ext3       23833    0.2710  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext3        5036    0.0573  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
ext4       64566    0.8961  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
ext4        5322    0.0713  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext4        2869    0.0384  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs        62126    1.7675  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
xfs         1904    0.0554  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs          103    0.0030  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
btrfs      10655    0.1338  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
btrfs       2020    0.0273  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
btrfs        587    0.0079  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
tmpfs      59562    3.2628  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
tmpfs       1210    0.0696  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
tmpfs         94    0.0054  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed

[akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Tested-by: Prabhakar Lad <prabhakar.csengg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29 17:40:52 -08:00
Mel Gorman
1183f6ad43 mm: shmem: avoid atomic operation during shmem_getpage_gfp
commit 07a4278843 upstream.

shmem_getpage_gfp uses an atomic operation to set the SwapBacked field
before it's even added to the LRU or visible.  This is unnecessary as what
could it possible race against?  Use an unlocked variant.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29 17:40:52 -08:00
Mel Gorman
5cfde3e59e mm: page_alloc: convert hot/cold parameter and immediate callers to bool
commit b745bc85f2 upstream.

cold is a bool, make it one.  Make the likely case the "if" part of the
block instead of the else as according to the optimisation manual this is
preferred.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29 17:40:51 -08:00
Mel Gorman
722191a52a mm: page_alloc: reduce number of times page_to_pfn is called
commit dc4b0caff2 upstream.

In the free path we calculate page_to_pfn multiple times. Reduce that.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29 17:40:51 -08:00
Mel Gorman
89712aa096 mm: page_alloc: use unsigned int for order in more places
commit 7aeb09f910 upstream.

X86 prefers the use of unsigned types for iterators and there is a
tendency to mix whether a signed or unsigned type if used for page order.
This converts a number of sites in mm/page_alloc.c to use unsigned int for
order where possible.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29 17:40:51 -08:00
Mel Gorman
ab9f54b2d1 mm: page_alloc: use jump labels to avoid checking number_of_cpusets
commit 664eeddeef upstream.

If cpusets are not in use then we still check a global variable on every
page allocation.  Use jump labels to avoid the overhead.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29 17:40:51 -08:00
Mel Gorman
7a0b5ca9ba include/linux/jump_label.h: expose the reference count
commit ea5e9539ab upstream.

This patch exposes the jump_label reference count in preparation for the
next patch.  cpusets cares about both the jump_label being enabled and how
many users of the cpusets there currently are.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29 17:40:51 -08:00
Jianyu Zhan
83a1ac8aeb mm/swap.c: clean up *lru_cache_add* functions
commit 2329d3751b upstream.

In mm/swap.c, __lru_cache_add() is exported, but actually there are no
users outside this file.

This patch unexports __lru_cache_add(), and makes it static.  It also
exports lru_cache_add_file(), as it is use by cifs and fuse, which can
loaded as modules.

Signed-off-by: Jianyu Zhan <nasa4836@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29 17:40:51 -08:00
Kees Cook
31c8446aac crypto: prefix module autoloading with "crypto-"
commit 5d26a105b5 upstream.

This prefixes all crypto module loading with "crypto-" so we never run
the risk of exposing module auto-loading to userspace via a crypto API,
as demonstrated by Mathias Krause:

https://lkml.org/lkml/2013/3/4/70

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29 17:40:50 -08:00
Sasha Levin
5fea291edd time: settimeofday: Validate the values of tv from user
commit 6ada1fc0e1 upstream.

An unvalidated user input is multiplied by a constant, which can result in
an undefined behaviour for large values. While this is validated later,
we should avoid triggering undefined behaviour.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
[jstultz: include trivial milisecond->microsecond correction noticed
by Andy]
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29 17:40:49 -08:00
Alex Williamson
408f4508ec PCI: Add flag for devices where we can't use bus reset
commit f331a859e0 upstream.

Enable a mechanism for devices to quirk that they do not behave when
doing a PCI bus reset.  We require a modest level of spec compliant
behavior in order to do a reset, for instance the device should come
out of reset without throwing errors and PCI config space should be
accessible after reset.  This is too much to ask for some devices.

Link: http://lkml.kernel.org/r/20140923210318.498dacbd@dualc.maya.org
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29 17:40:48 -08:00
Thomas Gleixner
5666a3de7a genirq: Prevent proc race against freeing of irq descriptors
commit c291ee6221 upstream.

Since the rework of the sparse interrupt code to actually free the
unused interrupt descriptors there exists a race between the /proc
interfaces to the irq subsystem and the code which frees the interrupt
descriptor.

CPU0				CPU1
				show_interrupts()
				  desc = irq_to_desc(X);
free_desc(desc)
  remove_from_radix_tree();
  kfree(desc);
				  raw_spinlock_irq(&desc->lock);

/proc/interrupts is the only interface which can actively corrupt
kernel memory via the lock access. /proc/stat can only read from freed
memory. Extremly hard to trigger, but possible.

The interfaces in /proc/irq/N/ are not affected by this because the
removal of the proc file is serialized in procfs against concurrent
readers/writers. The removal happens before the descriptor is freed.

For architectures which have CONFIG_SPARSE_IRQ=n this is a non issue
as the descriptor is never freed. It's merely cleared out with the irq
descriptor lock held. So any concurrent proc access will either see
the old correct value or the cleared out ones.

Protect the lookup and access to the irq descriptor in
show_interrupts() with the sparse_irq_lock.

Provide kstat_irqs_usr() which is protecting the lookup and access
with sparse_irq_lock and switch /proc/stat to use it.

Document the existing kstat_irqs interfaces so it's clear that the
caller needs to take care about protection. The users of these
interfaces are either not affected due to SPARSE_IRQ=n or already
protected against removal.

Fixes: 1f5a5b87f7 "genirq: Implement a sane sparse_irq allocator"
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-27 08:18:55 -08:00
stephen hemminger
1bd686b1ae in6: fix conflict with glibc
[ Upstream commit 6d08acd2d3 ]

Resolve conflicts between glibc definition of IPV6 socket options
and those defined in Linux headers. Looks like earlier efforts to
solve this did not cover all the definitions.

It resolves warnings during iproute2 build.
Please consider for stable as well.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-27 08:18:53 -08:00
Linus Torvalds
11e4f3bfdf mm: propagate error from stack expansion even for guard page
commit fee7e49d45 upstream.

Jay Foad reports that the address sanitizer test (asan) sometimes gets
confused by a stack pointer that ends up being outside the stack vma
that is reported by /proc/maps.

This happens due to an interaction between RLIMIT_STACK and the guard
page: when we do the guard page check, we ignore the potential error
from the stack expansion, which effectively results in a missing guard
page, since the expected stack expansion won't have been done.

And since /proc/maps explicitly ignores the guard page (commit
d7824370e2: "mm: fix up some user-visible effects of the stack guard
page"), the stack pointer ends up being outside the reported stack area.

This is the minimal patch: it just propagates the error.  It also
effectively makes the guard page part of the stack limit, which in turn
measn that the actual real stack is one page less than the stack limit.

Let's see if anybody notices.  We could teach acct_stack_growth() to
allow an extra page for a grow-up/grow-down stack in the rlimit test,
but I don't want to add more complexity if it isn't needed.

Reported-and-tested-by: Jay Foad <jay.foad@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-16 06:59:35 -08:00
Steven Rostedt (Red Hat)
37b2a5a7c8 tracing/sched: Check preempt_count() for current when reading task->state
commit aee4e5f3d3 upstream.

When recording the state of a task for the sched_switch tracepoint a check of
task_preempt_count() is performed to see if PREEMPT_ACTIVE is set. This is
because, technically, a task being preempted is really in the TASK_RUNNING
state, and that is what should be recorded when tracing a sched_switch,
even if the task put itself into another state (it hasn't scheduled out
in that state yet).

But with the change to use per_cpu preempt counts, the
task_thread_info(p)->preempt_count is no longer used, and instead
task_preempt_count(p) is used.

The problem is that this does not use the current preempt count but a stale
one from a previous sched_switch. The task_preempt_count(p) uses
saved_preempt_count and not preempt_count(). But for tracing sched_switch,
if p is current, we really want preempt_count().

I hit this bug when I was tracing sleep and the call from do_nanosleep()
scheduled out in the "RUNNING" state.

           sleep-4290  [000] 537272.259992: sched_switch:         sleep:4290 [120] R ==> swapper/0:0 [120]
           sleep-4290  [000] 537272.260015: kernel_stack:         <stack trace>
=> __schedule (ffffffff8150864a)
=> schedule (ffffffff815089f8)
=> do_nanosleep (ffffffff8150b76c)
=> hrtimer_nanosleep (ffffffff8108d66b)
=> SyS_nanosleep (ffffffff8108d750)
=> return_to_handler (ffffffff8150e8e5)
=> tracesys_phase2 (ffffffff8150c844)

After a bit of hair pulling, I found that the state was really
TASK_INTERRUPTIBLE, but the saved_preempt_count had an old PREEMPT_ACTIVE
set and caused the sched_switch tracepoint to show it as RUNNING.

Link: http://lkml.kernel.org/r/20141210174428.3cb7542a@gandalf.local.home

Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Fixes: 0102874755 "sched: Create more preempt_count accessors"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-16 06:59:32 -08:00
Tony Lindgren
2e5df19fc0 pstore-ram: Allow optional mapping with pgprot_noncached
commit 027bc8b082 upstream.

On some ARMs the memory can be mapped pgprot_noncached() and still
be working for atomic operations. As pointed out by Colin Cross
<ccross@android.com>, in some cases you do want to use
pgprot_noncached() if the SoC supports it to see a debug printk
just before a write hanging the system.

On ARMs, the atomic operations on strongly ordered memory are
implementation defined. So let's provide an optional kernel parameter
for configuring pgprot_noncached(), and use pgprot_writecombine() by
default.

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rob Herring <robherring2@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Olof Johansson <olof@lixom.net>
Cc: Russell King <linux@arm.linux.org.uk>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-16 06:59:30 -08:00
Richard Guy Briggs
1cfececcbc audit: restore AUDIT_LOGINUID unset ABI
commit 041d7b98ff upstream.

A regression was caused by commit 780a7654ce:
	 audit: Make testing for a valid loginuid explicit.
(which in turn attempted to fix a regression caused by e1760bd)

When audit_krule_to_data() fills in the rules to get a listing, there was a
missing clause to convert back from AUDIT_LOGINUID_SET to AUDIT_LOGINUID.

This broke userspace by not returning the same information that was sent and
expected.

The rule:
	auditctl -a exit,never -F auid=-1
gives:
	auditctl -l
		LIST_RULES: exit,never f24=0 syscall=all
when it should give:
		LIST_RULES: exit,never auid=-1 (0xffffffff) syscall=all

Tag it so that it is reported the same way it was set.  Create a new
private flags audit_krule field (pflags) to store it that won't interact with
the public one from the API.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-08 10:00:50 -08:00
Eric W. Biederman
cbc4266efe userns: Add a knob to disable setgroups on a per user namespace basis
commit 9cc46516dd upstream.

- Expose the knob to user space through a proc file /proc/<pid>/setgroups

  A value of "deny" means the setgroups system call is disabled in the
  current processes user namespace and can not be enabled in the
  future in this user namespace.

  A value of "allow" means the segtoups system call is enabled.

- Descendant user namespaces inherit the value of setgroups from
  their parents.

- A proc file is used (instead of a sysctl) as sysctls currently do
  not allow checking the permissions at open time.

- Writing to the proc file is restricted to before the gid_map
  for the user namespace is set.

  This ensures that disabling setgroups at a user namespace
  level will never remove the ability to call setgroups
  from a process that already has that ability.

  A process may opt in to the setgroups disable for itself by
  creating, entering and configuring a user namespace or by calling
  setns on an existing user namespace with setgroups disabled.
  Processes without privileges already can not call setgroups so this
  is a noop.  Prodcess with privilege become processes without
  privilege when entering a user namespace and as with any other path
  to dropping privilege they would not have the ability to call
  setgroups.  So this remains within the bounds of what is possible
  without a knob to disable setgroups permanently in a user namespace.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-08 10:00:50 -08:00
Eric W. Biederman
f077e88fc9 userns: Don't allow setgroups until a gid mapping has been setablished
commit 273d2c67c3 upstream.

setgroups is unique in not needing a valid mapping before it can be called,
in the case of setgroups(0, NULL) which drops all supplemental groups.

The design of the user namespace assumes that CAP_SETGID can not actually
be used until a gid mapping is established.  Therefore add a helper function
to see if the user namespace gid mapping has been established and call
that function in the setgroups permission check.

This is part of the fix for CVE-2014-8989, being able to drop groups
without privilege using user namespaces.

Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-08 10:00:50 -08:00
Eric W. Biederman
00fcd1ceab groups: Consolidate the setgroups permission checks
commit 7ff4d90b4c upstream.

Today there are 3 instances of setgroups and due to an oversight their
permission checking has diverged.  Add a common function so that
they may all share the same permission checking code.

This corrects the current oversight in the current permission checks
and adds a helper to avoid this in the future.

A user namespace security fix will update this new helper, shortly.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-08 10:00:49 -08:00
Maxime COQUELIN
42be4e5c4a bitops: Fix shift overflow in GENMASK macros
commit 00b4d9a141 upstream.

On some 32 bits architectures, including x86, GENMASK(31, 0) returns 0
instead of the expected ~0UL.

This is the same on some 64 bits architectures with GENMASK_ULL(63, 0).

This is due to an overflow in the shift operand, 1 << 32 for GENMASK,
1 << 64 for GENMASK_ULL.

Reported-by: Eric Paire <eric.paire@st.com>
Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Maxime Coquelin <maxime.coquelin@st.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: linux@rasmusvillemoes.dk
Cc: gong.chen@linux.intel.com
Cc: John Sullivan <jsrhbz@kanargh.force9.co.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Fixes: 10ef6b0dff ("bitops: Introduce a more generic BITMASK macro")
Link: http://lkml.kernel.org/r/1415267659-10563-1-git-send-email-maxime.coquelin@st.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-12-06 15:55:39 -08:00
Cristina Ciocan
ecee7138d8 iio: Fix IIO_EVENT_CODE_EXTRACT_DIR bit mask
commit ccf54555da upstream.

The direction field is set on 7 bits, thus we need to AND it with 0111 111 mask
in order to retrieve it, that is 0x7F, not 0xCF as it is now.

Fixes: ade7ef7ba (staging:iio: Differential channel handling)
Signed-off-by: Cristina Ciocan <cristina.ciocan@intel.com>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-12-06 15:55:36 -08:00
Takashi Iwai
05619290c1 ASoC: dpcm: Fix race between FE/BE updates and trigger
commit ea9d0d771f upstream.

DPCM can update the FE/BE connection states totally asynchronously
from the FE's PCM state.  Most of FE/BE state changes are protected by
mutex, so that they won't race, but there are still some actions that
are uncovered.  For example, suppose to switch a BE while a FE's
stream is running.  This would call soc_dpcm_runtime_update(), which
sets FE's runtime_update flag, then sets up and starts BEs, and clears
FE's runtime_update flag again.

When a device emits XRUN during this operation, the PCM core triggers
snd_pcm_stop(XRUN).  Since the trigger action is an atomic ops, this
isn't blocked by the mutex, thus it kicks off DPCM's trigger action.
It eventually updates and clears FE's runtime_update flag while
soc_dpcm_runtime_update() is running concurrently, and it results in
confusion.

Usually, for avoiding such a race, we take a lock.  There is a PCM
stream lock for that purpose.  However, as already mentioned, the
trigger action is atomic, and we can't take the lock for the whole
soc_dpcm_runtime_update() or other operations that include the lengthy
jobs like hw_params or prepare.

This patch provides an alternative solution.  This adds a way to defer
the conflicting trigger callback to be executed at the end of FE/BE
state changes.  For doing it, two things are introduced:

- Each runtime_update state change of FEs is protected via PCM stream
  lock.
- The FE's trigger callback checks the runtime_update flag.  If it's
  not set, the trigger action is executed there.  If set, mark the
  pending trigger action and returns immediately.
- At the exit of runtime_update state change, it checks whether the
  pending trigger is present.  If yes, it executes the trigger action
  at this point.

Reported-and-tested-by: Qiao Zhou <zhouqiao@marvell.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Acked-by: Liam Girdwood <liam.r.girdwood@linux.intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-12-06 15:55:35 -08:00
Benjamin Herrenschmidt
2c120e85de PCI/MSI: Add device flag indicating that 64-bit MSIs don't work
commit f144d1496b upstream.

This can be set by quirks/drivers to be used by the architecture code
that assigns the MSI addresses.

We additionally add verification in the core MSI code that the values
assigned by the architecture do satisfy the limitation in order to fail
gracefully if they don't (ie. the arch hasn't been updated to deal with
that quirk yet).

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-12-06 15:55:34 -08:00
Vincent BENAYOUN
548da392f3 inetdevice: fixed signed integer overflow
[ Upstream commit 84bc88688e ]

There could be a signed overflow in the following code.

The expression, (32-logmask) is comprised between 0 and 31 included.
It may be equal to 31.
In such a case the left shift will produce a signed integer overflow.
According to the C99 Standard, this is an undefined behavior.
A simple fix is to replace the signed int 1 with the unsigned int 1U.

Signed-off-by: Vincent BENAYOUN <vincent.benayoun@trust-in-soft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-12-06 15:55:33 -08:00
Vlastimil Babka
41c9323cf1 mm/compaction: do not count migratepages when unnecessary
commit f8c9301fa5 upstream.

During compaction, update_nr_listpages() has been used to count remaining
non-migrated and free pages after a call to migrage_pages().  The
freepages counting has become unneccessary, and it turns out that
migratepages counting is also unnecessary in most cases.

The only situation when it's needed to count cc->migratepages is when
migrate_pages() returns with a negative error code.  Otherwise, the
non-negative return value is the number of pages that were not migrated,
which is exactly the count of remaining pages in the cc->migratepages
list.

Furthermore, any non-zero count is only interesting for the tracepoint of
mm_compaction_migratepages events, because after that all remaining
unmigrated pages are put back and their count is set to 0.

This patch therefore removes update_nr_listpages() completely, and changes
the tracepoint definition so that the manual counting is done only when
the tracepoint is enabled, and only when migrate_pages() returns a
negative error code.

Furthermore, migrate_pages() and the tracepoints won't be called when
there's nothing to migrate.  This potentially avoids some wasted cycles
and reduces the volume of uninteresting mm_compaction_migratepages events
where "nr_migrated=0 nr_failed=0".  In the stress-highalloc mmtest, this
was about 75% of the events.  The mm_compaction_isolate_migratepages event
is better for determining that nothing was isolated for migration, and
this one was just duplicating the info.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21 09:23:07 -08:00
David Rientjes
102a623045 mm, compaction: embed migration mode in compact_control
commit e0b9daeb45 upstream.

We're going to want to manipulate the migration mode for compaction in the
page allocator, and currently compact_control's sync field is only a bool.

Currently, we only do MIGRATE_ASYNC or MIGRATE_SYNC_LIGHT compaction
depending on the value of this bool.  Convert the bool to enum
migrate_mode and pass the migration mode in directly.  Later, we'll want
to avoid MIGRATE_SYNC_LIGHT for thp allocations in the pagefault patch to
avoid unnecessary latency.

This also alters compaction triggered from sysfs, either for the entire
system or for a node, to force MIGRATE_SYNC.

[akpm@linux-foundation.org: fix build]
[iamjoonsoo.kim@lge.com: use MIGRATE_SYNC in alloc_contig_range()]
Signed-off-by: David Rientjes <rientjes@google.com>
Suggested-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21 09:23:07 -08:00
David Rientjes
3793816b67 mm, compaction: add per-zone migration pfn cache for async compaction
commit 35979ef339 upstream.

Each zone has a cached migration scanner pfn for memory compaction so that
subsequent calls to memory compaction can start where the previous call
left off.

Currently, the compaction migration scanner only updates the per-zone
cached pfn when pageblocks were not skipped for async compaction.  This
creates a dependency on calling sync compaction to avoid having subsequent
calls to async compaction from scanning an enormous amount of non-MOVABLE
pageblocks each time it is called.  On large machines, this could be
potentially very expensive.

This patch adds a per-zone cached migration scanner pfn only for async
compaction.  It is updated everytime a pageblock has been scanned in its
entirety and when no pages from it were successfully isolated.  The cached
migration scanner pfn for sync compaction is updated only when called for
sync compaction.

Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21 09:23:07 -08:00
David Rientjes
a527e8d4f7 mm, migration: add destination page freeing callback
commit 68711a7463 upstream.

Memory migration uses a callback defined by the caller to determine how to
allocate destination pages.  When migration fails for a source page,
however, it frees the destination page back to the system.

This patch adds a memory migration callback defined by the caller to
determine how to free destination pages.  If a caller, such as memory
compaction, builds its own freelist for migration targets, this can reuse
already freed memory instead of scanning additional memory.

If the caller provides a function to handle freeing of destination pages,
it is called when page migration fails.  If the caller passes NULL then
freeing back to the system will be handled as usual.  This patch
introduces no functional change.

Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21 09:23:06 -08:00
Fabian Frederick
aa64050a24 mm/readahead.c: inline ra_submit
commit 29f175d125 upstream.

Commit f9acc8c7b3 ("readahead: sanify file_ra_state names") left
ra_submit with a single function call.

Move ra_submit to internal.h and inline it to save some stack.  Thanks
to Andrew Morton for commenting different versions.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21 09:23:06 -08:00
Sasha Levin
034c4b3e83 mm: remove read_cache_page_async()
commit 67f9fd91f9 upstream.

This patch removes read_cache_page_async() which wasn't really needed
anywhere and simplifies the code around it a bit.

read_cache_page_async() is useful when we want to read a page into the
cache without waiting for it to complete.  This happens when the
appropriate callback 'filler' doesn't complete its read operation and
releases the page lock immediately, and instead queues a different
completion routine to do that.  This never actually happened anywhere in
the code.

read_cache_page_async() had 3 different callers:

- read_cache_page() which is the sync version, it would just wait for
  the requested read to complete using wait_on_page_read().

- JFFS2 would call it from jffs2_gc_fetch_page(), but the filler
  function it supplied doesn't do any async reads, and would complete
  before the filler function returns - making it actually a sync read.

- CRAMFS would call it using the read_mapping_page_async() wrapper, with
  a similar story to JFFS2 - the filler function doesn't do anything that
  reminds async reads and would always complete before the filler function
  returns.

To sum it up, the code in mm/filemap.c never took advantage of having
read_cache_page_async().  While there are filler callbacks that do async
reads (such as the block one), we always called it with the
read_cache_page().

This patch adds a mandatory wait for read to complete when adding a new
page to the cache, and removes read_cache_page_async() and its wrappers.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21 09:23:06 -08:00
Johannes Weiner
414af56f42 mm + fs: prepare for non-page entries in page cache radix trees
commit 0cd6144aad upstream.

shmem mappings already contain exceptional entries where swap slot
information is remembered.

To be able to store eviction information for regular page cache, prepare
every site dealing with the radix trees directly to handle entries other
than pages.

The common lookup functions will filter out non-page entries and return
NULL for page cache holes, just as before.  But provide a raw version of
the API which returns non-page entries as well, and switch shmem over to
use it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21 09:23:06 -08:00
Johannes Weiner
d141bb0e3f mm: filemap: move radix tree hole searching here
commit e7b563bb2a upstream.

The radix tree hole searching code is only used for page cache, for
example the readahead code trying to get a a picture of the area
surrounding a fault.

It sufficed to rely on the radix tree definition of holes, which is
"empty tree slot".  But this is about to change, though, as shadow page
descriptors will be stored in the page cache after the actual pages get
evicted from memory.

Move the functions over to mm/filemap.c and make them native page cache
operations, where they can later be adapted to handle the new definition
of "page cache hole".

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21 09:23:06 -08:00
Johannes Weiner
d35a6232f8 lib: radix-tree: add radix_tree_delete_item()
commit 53c59f262d upstream.

Provide a function that does not just delete an entry at a given index,
but also allows passing in an expected item.  Delete only if that item
is still located at the specified index.

This is handy when lockless tree traversals want to delete entries as
well because they don't have to do an second, locked lookup to verify
the slot has not changed under them before deleting the entry.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21 09:23:06 -08:00
Daniel Borkmann
e36b6ac9e0 net: sctp: fix skb_over_panic when receiving malformed ASCONF chunks
commit 9de7922bc7 upstream.

Commit 6f4c618ddb ("SCTP : Add paramters validity check for
ASCONF chunk") added basic verification of ASCONF chunks, however,
it is still possible to remotely crash a server by sending a
special crafted ASCONF chunk, even up to pre 2.6.12 kernels:

skb_over_panic: text:ffffffffa01ea1c3 len:31056 put:30768
 head:ffff88011bd81800 data:ffff88011bd81800 tail:0x7950
 end:0x440 dev:<NULL>
 ------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:129!
[...]
Call Trace:
 <IRQ>
 [<ffffffff8144fb1c>] skb_put+0x5c/0x70
 [<ffffffffa01ea1c3>] sctp_addto_chunk+0x63/0xd0 [sctp]
 [<ffffffffa01eadaf>] sctp_process_asconf+0x1af/0x540 [sctp]
 [<ffffffff8152d025>] ? _read_unlock_bh+0x15/0x20
 [<ffffffffa01e0038>] sctp_sf_do_asconf+0x168/0x240 [sctp]
 [<ffffffffa01e3751>] sctp_do_sm+0x71/0x1210 [sctp]
 [<ffffffff8147645d>] ? fib_rules_lookup+0xad/0xf0
 [<ffffffffa01e6b22>] ? sctp_cmp_addr_exact+0x32/0x40 [sctp]
 [<ffffffffa01e8393>] sctp_assoc_bh_rcv+0xd3/0x180 [sctp]
 [<ffffffffa01ee986>] sctp_inq_push+0x56/0x80 [sctp]
 [<ffffffffa01fcc42>] sctp_rcv+0x982/0xa10 [sctp]
 [<ffffffffa01d5123>] ? ipt_local_in_hook+0x23/0x28 [iptable_filter]
 [<ffffffff8148bdc9>] ? nf_iterate+0x69/0xb0
 [<ffffffff81496d10>] ? ip_local_deliver_finish+0x0/0x2d0
 [<ffffffff8148bf86>] ? nf_hook_slow+0x76/0x120
 [<ffffffff81496d10>] ? ip_local_deliver_finish+0x0/0x2d0
 [<ffffffff81496ded>] ip_local_deliver_finish+0xdd/0x2d0
 [<ffffffff81497078>] ip_local_deliver+0x98/0xa0
 [<ffffffff8149653d>] ip_rcv_finish+0x12d/0x440
 [<ffffffff81496ac5>] ip_rcv+0x275/0x350
 [<ffffffff8145c88b>] __netif_receive_skb+0x4ab/0x750
 [<ffffffff81460588>] netif_receive_skb+0x58/0x60

This can be triggered e.g., through a simple scripted nmap
connection scan injecting the chunk after the handshake, for
example, ...

  -------------- INIT[ASCONF; ASCONF_ACK] ------------->
  <----------- INIT-ACK[ASCONF; ASCONF_ACK] ------------
  -------------------- COOKIE-ECHO -------------------->
  <-------------------- COOKIE-ACK ---------------------
  ------------------ ASCONF; UNKNOWN ------------------>

... where ASCONF chunk of length 280 contains 2 parameters ...

  1) Add IP address parameter (param length: 16)
  2) Add/del IP address parameter (param length: 255)

... followed by an UNKNOWN chunk of e.g. 4 bytes. Here, the
Address Parameter in the ASCONF chunk is even missing, too.
This is just an example and similarly-crafted ASCONF chunks
could be used just as well.

The ASCONF chunk passes through sctp_verify_asconf() as all
parameters passed sanity checks, and after walking, we ended
up successfully at the chunk end boundary, and thus may invoke
sctp_process_asconf(). Parameter walking is done with
WORD_ROUND() to take padding into account.

In sctp_process_asconf()'s TLV processing, we may fail in
sctp_process_asconf_param() e.g., due to removal of the IP
address that is also the source address of the packet containing
the ASCONF chunk, and thus we need to add all TLVs after the
failure to our ASCONF response to remote via helper function
sctp_add_asconf_response(), which basically invokes a
sctp_addto_chunk() adding the error parameters to the given
skb.

When walking to the next parameter this time, we proceed
with ...

  length = ntohs(asconf_param->param_hdr.length);
  asconf_param = (void *)asconf_param + length;

... instead of the WORD_ROUND()'ed length, thus resulting here
in an off-by-one that leads to reading the follow-up garbage
parameter length of 12336, and thus throwing an skb_over_panic
for the reply when trying to sctp_addto_chunk() next time,
which implicitly calls the skb_put() with that length.

Fix it by using sctp_walk_params() [ which is also used in
INIT parameter processing ] macro in the verification *and*
in ASCONF processing: it will make sure we don't spill over,
that we walk parameters WORD_ROUND()'ed. Moreover, we're being
more defensive and guard against unknown parameter types and
missized addresses.

Joint work with Vlad Yasevich.

Fixes: b896b82be4ae ("[SCTP] ADDIP: Support for processing incoming ASCONF_ACK chunks.")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Vlad Yasevich <vyasevich@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Josh Boyer <jwboyer@fedoraproject.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21 09:23:06 -08:00
Daniel Borkmann
59ea8663e3 net: sctp: fix panic on duplicate ASCONF chunks
commit b69040d8e3 upstream.

When receiving a e.g. semi-good formed connection scan in the
form of ...

  -------------- INIT[ASCONF; ASCONF_ACK] ------------->
  <----------- INIT-ACK[ASCONF; ASCONF_ACK] ------------
  -------------------- COOKIE-ECHO -------------------->
  <-------------------- COOKIE-ACK ---------------------
  ---------------- ASCONF_a; ASCONF_b ----------------->

... where ASCONF_a equals ASCONF_b chunk (at least both serials
need to be equal), we panic an SCTP server!

The problem is that good-formed ASCONF chunks that we reply with
ASCONF_ACK chunks are cached per serial. Thus, when we receive a
same ASCONF chunk twice (e.g. through a lost ASCONF_ACK), we do
not need to process them again on the server side (that was the
idea, also proposed in the RFC). Instead, we know it was cached
and we just resend the cached chunk instead. So far, so good.

Where things get nasty is in SCTP's side effect interpreter, that
is, sctp_cmd_interpreter():

While incoming ASCONF_a (chunk = event_arg) is being marked
!end_of_packet and !singleton, and we have an association context,
we do not flush the outqueue the first time after processing the
ASCONF_ACK singleton chunk via SCTP_CMD_REPLY. Instead, we keep it
queued up, although we set local_cork to 1. Commit 2e3216cd54
changed the precedence, so that as long as we get bundled, incoming
chunks we try possible bundling on outgoing queue as well. Before
this commit, we would just flush the output queue.

Now, while ASCONF_a's ASCONF_ACK sits in the corked outq, we
continue to process the same ASCONF_b chunk from the packet. As
we have cached the previous ASCONF_ACK, we find it, grab it and
do another SCTP_CMD_REPLY command on it. So, effectively, we rip
the chunk->list pointers and requeue the same ASCONF_ACK chunk
another time. Since we process ASCONF_b, it's correctly marked
with end_of_packet and we enforce an uncork, and thus flush, thus
crashing the kernel.

Fix it by testing if the ASCONF_ACK is currently pending and if
that is the case, do not requeue it. When flushing the output
queue we may relink the chunk for preparing an outgoing packet,
but eventually unlink it when it's copied into the skb right
before transmission.

Joint work with Vlad Yasevich.

Fixes: 2e3216cd54 ("sctp: Follow security requirement of responding with 1 packet")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Josh Boyer <jwboyer@fedoraproject.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21 09:23:05 -08:00