Commit Graph

5177 Commits

Author SHA1 Message Date
Andrea Arcangeli
3ddb5b56f0 mm: thp: fix BUG on mm->nr_ptes
commit 1c641e8471 upstream.

Dave Jones reports a few Fedora users hitting the BUG_ON(mm->nr_ptes...)
in exit_mmap() recently.

Quoting Hugh's discovery and explanation of the SMP race condition:

  "mm->nr_ptes had unusual locking: down_read mmap_sem plus
   page_table_lock when incrementing, down_write mmap_sem (or mm_users
   0) when decrementing; whereas THP is careful to increment and
   decrement it under page_table_lock.

   Now most of those paths in THP also hold mmap_sem for read or write
   (with appropriate checks on mm_users), but two do not: when
   split_huge_page() is called by hwpoison_user_mappings(), and when
   called by add_to_swap().

   It's conceivable that the latter case is responsible for the
   exit_mmap() BUG_ON mm->nr_ptes that has been reported on Fedora."

The simplest way to fix it without having to alter the locking is to make
split_huge_page() a noop in nr_ptes terms, so by counting the preallocated
pagetables that exists for every mapped hugepage.  It was an arbitrary
choice not to count them and either way is not wrong or right, because
they are not used but they're still allocated.

Reported-by: Dave Jones <davej@redhat.com>
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Josh Boyer <jwboyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-03-12 10:32:56 -07:00
David Howells
0f062a5c0e NOMMU: Don't need to clear vm_mm when deleting a VMA
commit b94cfaf668 upstream.

Don't clear vm_mm in a deleted VMA as it's unnecessary and might
conceivably break the filesystem or driver VMA close routine.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-03-12 10:32:56 -07:00
Anton Vorontsov
509c46f3d4 mm: memcg: Correct unregistring of events attached to the same eventfd
commit 371528caec upstream.

There is an issue when memcg unregisters events that were attached to
the same eventfd:

- On the first call mem_cgroup_usage_unregister_event() removes all
  events attached to a given eventfd, and if there were no events left,
  thresholds->primary would become NULL;

- Since there were several events registered, cgroups core will call
  mem_cgroup_usage_unregister_event() again, but now kernel will oops,
  as the function doesn't expect that threshold->primary may be NULL.

That's a good question whether mem_cgroup_usage_unregister_event()
should actually remove all events in one go, but nowadays it can't
do any better as cftype->unregister_event callback doesn't pass
any private event-associated cookie. So, let's fix the issue by
simply checking for threshold->primary.

FWIW, w/o the patch the following oops may be observed:

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
 IP: [<ffffffff810be32c>] mem_cgroup_usage_unregister_event+0x9c/0x1f0
 Pid: 574, comm: kworker/0:2 Not tainted 3.3.0-rc4+ #9 Bochs Bochs
 RIP: 0010:[<ffffffff810be32c>]  [<ffffffff810be32c>] mem_cgroup_usage_unregister_event+0x9c/0x1f0
 RSP: 0018:ffff88001d0b9d60  EFLAGS: 00010246
 Process kworker/0:2 (pid: 574, threadinfo ffff88001d0b8000, task ffff88001de91cc0)
 Call Trace:
  [<ffffffff8107092b>] cgroup_event_remove+0x2b/0x60
  [<ffffffff8103db94>] process_one_work+0x174/0x450
  [<ffffffff8103e413>] worker_thread+0x123/0x2d0

Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-03-12 10:32:55 -07:00
David Howells
251ebc7bb4 NOMMU: Lock i_mmap_mutex for access to the VMA prio list
commit 918e556ec2 upstream.

Lock i_mmap_mutex for access to the VMA prio list to prevent concurrent
access.  Currently, certain parts of the mmap handling are protected by
the region mutex, but not all.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-02-29 16:33:36 -08:00
Eric Dumazet
16c7560fcc slub: fix a possible memleak in __slab_alloc()
commit 73736e0387 upstream.

Zhihua Che reported a possible memleak in slub allocator on
CONFIG_PREEMPT=y builds.

It is possible current thread migrates right before disabling irqs in
__slab_alloc(). We must check again c->freelist, and perform a normal
allocation instead of scratching c->freelist.

Many thanks to Zhihua Che for spotting this bug, introduced in 2.6.39

V2: Its also possible an IRQ freed one (or several) object(s) and
populated c->freelist, so its not a CONFIG_PREEMPT only problem.

Reported-by: Zhihua Che <zhihua.che@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-02-20 12:48:14 -08:00
Hugh Dickins
91b08ca08c mm: fix UP THP spin_is_locked BUGs
commit b9980cdcf2 upstream.

Fix CONFIG_TRANSPARENT_HUGEPAGE=y CONFIG_SMP=n CONFIG_DEBUG_VM=y
CONFIG_DEBUG_SPINLOCK=n kernel: spin_is_locked() is then always false,
and so triggers some BUGs in Transparent HugePage codepaths.

asm-generic/bug.h mentions this problem, and provides a WARN_ON_SMP(x);
but being too lazy to add VM_BUG_ON_SMP, BUG_ON_SMP, WARN_ON_SMP_ONCE,
VM_WARN_ON_SMP_ONCE, just test NR_CPUS != 1 in the existing VM_BUG_ONs.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-02-13 11:06:11 -08:00
Mel Gorman
b913481230 mm: compaction: check for overlapping nodes during isolation for migration
commit dc9086004b upstream.

When isolating pages for migration, migration starts at the start of a
zone while the free scanner starts at the end of the zone.  Migration
avoids entering a new zone by never going beyond the free scanned.

Unfortunately, in very rare cases nodes can overlap.  When this happens,
migration isolates pages without the LRU lock held, corrupting lists
which will trigger errors in reclaim or during page free such as in the
following oops

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
  IP: [<ffffffff810f795c>] free_pcppages_bulk+0xcc/0x450
  PGD 1dda554067 PUD 1e1cb58067 PMD 0
  Oops: 0000 [#1] SMP
  CPU 37
  Pid: 17088, comm: memcg_process_s Tainted: G            X
  RIP: free_pcppages_bulk+0xcc/0x450
  Process memcg_process_s (pid: 17088, threadinfo ffff881c2926e000, task ffff881c2926c0c0)
  Call Trace:
    free_hot_cold_page+0x17e/0x1f0
    __pagevec_free+0x90/0xb0
    release_pages+0x22a/0x260
    pagevec_lru_move_fn+0xf3/0x110
    putback_lru_page+0x66/0xe0
    unmap_and_move+0x156/0x180
    migrate_pages+0x9e/0x1b0
    compact_zone+0x1f3/0x2f0
    compact_zone_order+0xa2/0xe0
    try_to_compact_pages+0xdf/0x110
    __alloc_pages_direct_compact+0xee/0x1c0
    __alloc_pages_slowpath+0x370/0x830
    __alloc_pages_nodemask+0x1b1/0x1c0
    alloc_pages_vma+0x9b/0x160
    do_huge_pmd_anonymous_page+0x160/0x270
    do_page_fault+0x207/0x4c0
    page_fault+0x25/0x30

The "X" in the taint flag means that external modules were loaded but but
is unrelated to the bug triggering.  The real problem was because the PFN
layout looks like this

  Zone PFN ranges:
    DMA      0x00000010 -> 0x00001000
    DMA32    0x00001000 -> 0x00100000
    Normal   0x00100000 -> 0x01e80000
  Movable zone start PFN for each node
  early_node_map[14] active PFN ranges
      0: 0x00000010 -> 0x0000009b
      0: 0x00000100 -> 0x0007a1ec
      0: 0x0007a354 -> 0x0007a379
      0: 0x0007f7ff -> 0x0007f800
      0: 0x00100000 -> 0x00680000
      1: 0x00680000 -> 0x00e80000
      0: 0x00e80000 -> 0x01080000
      1: 0x01080000 -> 0x01280000
      0: 0x01280000 -> 0x01480000
      1: 0x01480000 -> 0x01680000
      0: 0x01680000 -> 0x01880000
      1: 0x01880000 -> 0x01a80000
      0: 0x01a80000 -> 0x01c80000
      1: 0x01c80000 -> 0x01e80000

The fix is straight-forward.  isolate_migratepages() has to make a
similar check to isolate_freepage to ensure that it never isolates pages
from a zone it does not hold the LRU lock for.

This was discovered in a 3.0-based kernel but it affects 3.1.x, 3.2.x
and current mainline.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-02-13 11:06:11 -08:00
Mel Gorman
77650df5a7 mm: compaction: check pfn_valid when entering a new MAX_ORDER_NR_PAGES block during isolation for migration
commit 0bf380bc70 upstream.

When isolating for migration, migration starts at the start of a zone
which is not necessarily pageblock aligned.  Further, it stops isolating
when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally
not aligned.  This allows isolate_migratepages() to call pfn_to_page() on
an invalid PFN which can result in a crash.  This was originally reported
against a 3.0-based kernel with the following trace in a crash dump.

PID: 9902   TASK: d47aecd0  CPU: 0   COMMAND: "memcg_process_s"
 #0 [d72d3ad0] crash_kexec at c028cfdb
 #1 [d72d3b24] oops_end at c05c5322
 #2 [d72d3b38] __bad_area_nosemaphore at c0227e60
 #3 [d72d3bec] bad_area at c0227fb6
 #4 [d72d3c00] do_page_fault at c05c72ec
 #5 [d72d3c80] error_code (via page_fault) at c05c47a4
    EAX: 00000000  EBX: 000c0000  ECX: 00000001  EDX: 00000807  EBP: 000c0000
    DS:  007b      ESI: 00000001  ES:  007b      EDI: f3000a80  GS:  6f50
    CS:  0060      EIP: c030b15a  ERR: ffffffff  EFLAGS: 00010002
 #6 [d72d3cb4] isolate_migratepages at c030b15a
 #7 [d72d3d14] zone_watermark_ok at c02d26cb
 #8 [d72d3d2c] compact_zone at c030b8de
 #9 [d72d3d68] compact_zone_order at c030bba1
#10 [d72d3db4] try_to_compact_pages at c030bc84
#11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7
#12 [d72d3e08] __alloc_pages_slowpath at c02d66c7
#13 [d72d3e78] __alloc_pages_nodemask at c02d6a97
#14 [d72d3eb8] alloc_pages_vma at c030a845
#15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb
#16 [d72d3f00] handle_mm_fault at c02f36c6
#17 [d72d3f30] do_page_fault at c05c70ed
#18 [d72d3fb0] error_code (via page_fault) at c05c47a4
    EAX: b71ff000  EBX: 00000001  ECX: 00001600  EDX: 00000431
    DS:  007b      ESI: 08048950  ES:  007b      EDI: bfaa3788
    SS:  007b      ESP: bfaa36e0  EBP: bfaa3828  GS:  6f50
    CS:  0073      EIP: 080487c8  ERR: ffffffff  EFLAGS: 00010202

It was also reported by Herbert van den Bergh against 3.1-based kernel
with the following snippet from the console log.

BUG: unable to handle kernel paging request at 01c00008
IP: [<c0522399>] isolate_migratepages+0x119/0x390
*pdpt = 000000002f7ce001 *pde = 0000000000000000

It is expected that it also affects 3.2.x and current mainline.

The problem is that pfn_valid is only called on the first PFN being
checked and that PFN is not necessarily aligned.  Lets say we have a case
like this

H = MAX_ORDER_NR_PAGES boundary
| = pageblock boundary
m = cc->migrate_pfn
f = cc->free_pfn
o = memory hole

H------|------H------|----m-Hoooooo|ooooooH-f----|------H

The migrate_pfn is just below a memory hole and the free scanner is beyond
the hole.  When isolate_migratepages started, it scans from migrate_pfn to
migrate_pfn+pageblock_nr_pages which is now in a memory hole.  It checks
pfn_valid() on the first PFN but then scans into the hole where there are
not necessarily valid struct pages.

This patch ensures that isolate_migratepages calls pfn_valid when
necessary.

Reported-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com>
Tested-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-02-13 11:06:07 -08:00
Carsten Otte
c713decb17 mm/filemap_xip.c: fix race condition in xip_file_fault()
commit 99f02ef1f1 upstream.

Fix a race condition that shows in conjunction with xip_file_fault() when
two threads of the same user process fault on the same memory page.

In this case, the race winner will install the page table entry and the
unlucky loser will cause an oops: xip_file_fault calls vm_insert_pfn (via
vm_insert_mixed) which drops out at this check:

	retval = -EBUSY;
	if (!pte_none(*pte))
		goto out_unlock;

The resulting -EBUSY return value will trigger a BUG_ON() in
xip_file_fault.

This fix simply considers the fault as fixed in this case, because the
race winner has successfully installed the pte.

[akpm@linux-foundation.org: use conventional (and consistent) comment layout]
Reported-by: David Sadler <dsadler@us.ibm.com>
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reported-by: Louis Alex Eisner <leisner@cs.ucsd.edu>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-02-13 11:06:07 -08:00
Shaohua Li
99067d6916 readahead: fix pipeline break caused by block plug
commit 3deaa7190a upstream.

Herbert Poetzl reported a performance regression since 2.6.39.  The test
is a simple dd read, but with big block size.  The reason is:

T1: ra (A, A+128k), (A+128k, A+256k)
T2: lock_page for page A, submit the 256k
T3: hit page A+128K, ra (A+256k, A+384). the range isn't submitted
because of plug and there isn't any lock_page till we hit page A+256k
because all pages from A to A+256k is in memory
T4: hit page A+256k, ra (A+384, A+ 512). Because of plug, the range isn't
submitted again.
T5: lock_page A+256k, so (A+256k, A+512k) will be submitted. The task is
waitting for (A+256k, A+512k) finish.

There is no request to disk in T3 and T4, so readahead pipeline breaks.

We really don't need block plug for generic_file_aio_read() for buffered
I/O.  The readahead already has plug and has fine grained control when I/O
should be submitted.  Deleting plug for buffered I/O fixes the regression.

One side effect is plug makes the request size 256k, the size is 128k
without it.  This is because default ra size is 128k and not a reason we
need plug here.

Vivek said:

: We submit some readahead IO to device request queue but because of nested
: plug, queue never gets unplugged.  When read logic reaches a page which is
: not in page cache, it waits for page to be read from the disk
: (lock_page_killable()) and that time we flush the plug list.
:
: So effectively read ahead logic is kind of broken in parts because of
: nested plugging.  Removing top level plug (generic_file_aio_read()) for
: buffered reads, will allow unplugging queue earlier for readahead.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Reported-by: Herbert Poetzl <herbert@13thfloor.at>
Tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-02-13 11:06:04 -08:00
Michal Hocko
bcbef18db9 mm: fix NULL ptr dereference in __count_immobile_pages
commit 687875fb7d upstream.

Fix the following NULL ptr dereference caused by

  cat /sys/devices/system/memory/memory0/removable

Pid: 13979, comm: sed Not tainted 3.0.13-0.5-default #1 IBM BladeCenter LS21 -[7971PAM]-/Server Blade
RIP: __count_immobile_pages+0x4/0x100
Process sed (pid: 13979, threadinfo ffff880221c36000, task ffff88022e788480)
Call Trace:
  is_pageblock_removable_nolock+0x34/0x40
  is_mem_section_removable+0x74/0xf0
  show_mem_removable+0x41/0x70
  sysfs_read_file+0xfe/0x1c0
  vfs_read+0xc7/0x130
  sys_read+0x53/0xa0
  system_call_fastpath+0x16/0x1b

We are crashing because we are trying to dereference NULL zone which
came from pfn=0 (struct page ffffea0000000000). According to the boot
log this page is marked reserved:
e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved)

and early_node_map confirms that:
early_node_map[3] active PFN ranges
    1: 0x00000010 -> 0x0000009c
    1: 0x00000100 -> 0x000bffa3
    1: 0x00100000 -> 0x00240000

The problem is that memory_present works in PAGE_SECTION_MASK aligned
blocks so the reserved range sneaks into the the section as well.  This
also means that free_area_init_node will not take care of those reserved
pages and they stay uninitialized.

When we try to read the removable status we walk through all available
sections and hope that the zone is valid for all pages in the section.
But this is not true in this case as the zone and nid are not initialized.

We have only one node in this particular case and it is marked as node=1
(rather than 0) and that made the problem visible because page_to_nid will
return 0 and there are no zones on the node.

Let's check that the zone is valid and that the given pfn falls into its
boundaries and mark the section not removable.  This might cause some
false positives, probably, but we do not have any sane way to find out
whether the page is reserved by the platform or it is just not used for
whatever other reasons.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-25 17:25:05 -08:00
KAMEZAWA Hiroyuki
ea1c627781 memcg: add mem_cgroup_replace_page_cache() to fix LRU issue
commit ab936cbcd0 upstream.

Commit ef6a3c6311 ("mm: add replace_page_cache_page() function") added a
function replace_page_cache_page().  This function replaces a page in the
radix-tree with a new page.  WHen doing this, memory cgroup needs to fix
up the accounting information.  memcg need to check PCG_USED bit etc.

In some(many?) cases, 'newpage' is on LRU before calling
replace_page_cache().  So, memcg's LRU accounting information should be
fixed, too.

This patch adds mem_cgroup_replace_page_cache() and removes the old hooks.
 In that function, old pages will be unaccounted without touching
res_counter and new page will be accounted to the memcg (of old page).
WHen overwriting pc->mem_cgroup of newpage, take zone->lru_lock and avoid
races with LRU handling.

Background:
  replace_page_cache_page() is called by FUSE code in its splice() handling.
  Here, 'newpage' is replacing oldpage but this newpage is not a newly allocated
  page and may be on LRU. LRU mis-accounting will be critical for memory cgroup
  because rmdir() checks the whole LRU is empty and there is no account leak.
  If a page is on the other LRU than it should be, rmdir() will fail.

This bug was added in March 2011, but no bug report yet.  I guess there
are not many people who use memcg and FUSE at the same time with upstream
kernels.

The result of this bug is that admin cannot destroy a memcg because of
account leak.  So, no panic, no deadlock.  And, even if an active cgroup
exist, umount can succseed.  So no problem at shutdown.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-25 17:24:43 -08:00
Hillf Danton
7204bf5ef7 mm: hugetlb: fix non-atomic enqueue of huge page
commit b0365c8d0c upstream.

If a huge page is enqueued under the protection of hugetlb_lock, then the
operation is atomic and safe.

Signed-off-by: Hillf Danton <dhillf@gmail.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-06 14:14:00 -08:00
Hillf Danton
4875f39a05 memcg: keep root group unchanged if creation fails
commit a41c58a666 upstream.

If the request is to create non-root group and we fail to meet it, we
should leave the root unchanged.

Signed-off-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-06 14:13:55 -08:00
Dave Kleikamp
2f79eddd49 vfs: __read_cache_page should use gfp argument rather than GFP_KERNEL
commit e6f67b8c05 upstream.

lockdep reports a deadlock in jfs because a special inode's rw semaphore
is taken recursively.  The mapping's gfp mask is GFP_NOFS, but is not
used when __read_cache_page() calls add_to_page_cache_lru().

Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-06 14:13:54 -08:00
Frantisek Hrbata
724c6ae291 oom: fix integer overflow of points in oom_badness
commit ff05b6f7ae upstream.

An integer overflow will happen on 64bit archs if task's sum of rss,
swapents and nr_ptes exceeds (2^31)/1000 value.  This was introduced by
commit

f755a04 oom: use pte pages in OOM score

where the oom score computation was divided into several steps and it's no
longer computed as one expression in unsigned long(rss, swapents, nr_pte
are unsigned long), where the result value assigned to points(int) is in
range(1..1000).  So there could be an int overflow while computing

176          points *= 1000;

and points may have negative value. Meaning the oom score for a mem hog task
will be one.

196          if (points <= 0)
197                  return 1;

For example:
[ 3366]     0  3366 35390480 24303939   5       0             0 oom01
Out of memory: Kill process 3366 (oom01) score 1 or sacrifice child

Here the oom1 process consumes more than 24303939(rss)*4096~=92GB physical
memory, but it's oom score is one.

In this situation the mem hog task is skipped and oom killer kills another and
most probably innocent task with oom score greater than one.

The points variable should be of type long instead of int to prevent the
int overflow.

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-06 14:13:51 -08:00
Eugene Surovegin
d051424f29 percpu: fix per_cpu_ptr_to_phys() handling of non-page-aligned addresses
commit 9f57bd4d6d upstream.

per_cpu_ptr_to_phys() incorrectly rounds up its result for non-kmalloc
case to the page boundary, which is bogus for any non-page-aligned
address.

This affects the only in-tree user of this function - sysfs handler
for per-cpu 'crash_notes' physical address.  The trouble is that the
crash_notes per-cpu variable is not page-aligned:

crash_notes = 0xc08e8ed4
PER-CPU OFFSET VALUES:
 CPU 0: 3711f000
 CPU 1: 37129000
 CPU 2: 37133000
 CPU 3: 3713d000

So, the per-cpu addresses are:
 crash_notes on CPU 0: f7a07ed4 => phys 36b57ed4
 crash_notes on CPU 1: f7a11ed4 => phys 36b4ded4
 crash_notes on CPU 2: f7a1bed4 => phys 36b43ed4
 crash_notes on CPU 3: f7a25ed4 => phys 36b39ed4

However, /sys/devices/system/cpu/cpu*/crash_notes says:
 /sys/devices/system/cpu/cpu0/crash_notes: 36b57000
 /sys/devices/system/cpu/cpu1/crash_notes: 36b4d000
 /sys/devices/system/cpu/cpu2/crash_notes: 36b43000
 /sys/devices/system/cpu/cpu3/crash_notes: 36b39000

As you can see, all values are rounded down to a page
boundary. Consequently, this is where kexec sets up the NOTE segments,
and thus where the secondary kernel is looking for them. However, when
the first kernel crashes, it saves the notes to the unaligned
addresses, where they are not found.

Fix it by adding offset_in_page() to the translated page address.

-tj: Combined Eugene's and Petr's commit messages.

Signed-off-by: Eugene Surovegin <ebs@ebshome.net>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Petr Tesarik <ptesarik@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-06 14:13:50 -08:00
Tejun Heo
6400c8a382 percpu: fix chunk range calculation
commit a855b84c3d upstream.

Percpu allocator recorded the cpus which map to the first and last
units in pcpu_first/last_unit_cpu respectively and used them to
determine the address range of a chunk - e.g. it assumed that the
first unit has the lowest address in a chunk while the last unit has
the highest address.

This simply isn't true.  Groups in a chunk can have arbitrary positive
or negative offsets from the previous one and there is no guarantee
that the first unit occupies the lowest offset while the last one the
highest.

Fix it by actually comparing unit offsets to determine cpus occupying
the lowest and highest offsets.  Also, rename pcu_first/last_unit_cpu
to pcpu_low/high_unit_cpu to avoid confusion.

The chunk address range is used to flush cache on vmalloc area
map/unmap and decide whether a given address is in the first chunk by
per_cpu_ptr_to_phys() and the bug was discovered by invalid
per_cpu_ptr_to_phys() translation for crash_note.

Kudos to Dave Young for tracking down the problem.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: WANG Cong <xiyou.wangcong@gmail.com>
Reported-by: Dave Young <dyoung@redhat.com>
Tested-by: Dave Young <dyoung@redhat.com>
LKML-Reference: <4EC21F67.10905@redhat.com>
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-12-21 12:57:37 -08:00
Mel Gorman
3a15d7377f mm: vmalloc: check for page allocation failure before vmlist insertion
commit 1368edf064 upstream.

Commit f5252e00 ("mm: avoid null pointer access in vm_struct via
/proc/vmallocinfo") adds newly allocated vm_structs to the vmlist after
it is fully initialised.  Unfortunately, it did not check that
__vmalloc_area_node() successfully populated the area.  In the event of
allocation failure, the vmalloc area is freed but the pointer to freed
memory is inserted into the vmlist leading to a a crash later in
get_vmalloc_info().

This patch adds a check for ____vmalloc_area_node() failure within
__vmalloc_node_range.  It does not use "goto fail" as in the previous
error path as a warning was already displayed by __vmalloc_area_node()
before it called vfree in its failure path.

Credit goes to Luciano Chavez for doing all the real work of identifying
exactly where the problem was.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Luciano Chavez <lnx1138@linux.vnet.ibm.com>
Tested-by: Luciano Chavez <lnx1138@linux.vnet.ibm.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-12-21 12:57:36 -08:00
Michal Hocko
21830d75b1 mm: Ensure that pfn_valid() is called once per pageblock when reserving pageblocks
commit d021563888 upstream.

setup_zone_migrate_reserve() expects that zone->start_pfn starts at
pageblock_nr_pages aligned pfn otherwise we could access beyond an
existing memblock resulting in the following panic if
CONFIG_HOLES_IN_ZONE is not configured and we do not check pfn_valid:

  IP: [<c02d331d>] setup_zone_migrate_reserve+0xcd/0x180
  *pdpt = 0000000000000000 *pde = f000ff53f000ff53
  Oops: 0000 [#1] SMP
  Pid: 1, comm: swapper Not tainted 3.0.7-0.7-pae #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
  EIP: 0060:[<c02d331d>] EFLAGS: 00010006 CPU: 0
  EIP is at setup_zone_migrate_reserve+0xcd/0x180
  EAX: 000c0000 EBX: f5801fc0 ECX: 000c0000 EDX: 00000000
  ESI: 000c01fe EDI: 000c01fe EBP: 00140000 ESP: f2475f58
  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
  Process swapper (pid: 1, ti=f2474000 task=f2472cd0 task.ti=f2474000)
  Call Trace:
  [<c02d389c>] __setup_per_zone_wmarks+0xec/0x160
  [<c02d3a1f>] setup_per_zone_wmarks+0xf/0x20
  [<c08a771c>] init_per_zone_wmark_min+0x27/0x86
  [<c020111b>] do_one_initcall+0x2b/0x160
  [<c086639d>] kernel_init+0xbe/0x157
  [<c05cae26>] kernel_thread_helper+0x6/0xd
  Code: a5 39 f5 89 f7 0f 46 fd 39 cf 76 40 8b 03 f6 c4 08 74 32 eb 91 90 89 c8 c1 e8 0e 0f be 80 80 2f 86 c0 8b 14 85 60 2f 86 c0 89 c8 <2b> 82 b4 12 00 00 c1 e0 05 03 82 ac 12 00 00 8b 00 f6 c4 08 0f
  EIP: [<c02d331d>] setup_zone_migrate_reserve+0xcd/0x180 SS:ESP 0068:f2475f58
  CR2: 00000000000012b4

We crashed in pageblock_is_reserved() when accessing pfn 0xc0000 because
highstart_pfn = 0x36ffe.

The issue was introduced in 3.0-rc1 by 6d3163ce ("mm: check if any page
in a pageblock is reserved before marking it MIGRATE_RESERVE").

Make sure that start_pfn is always aligned to pageblock_nr_pages to
ensure that pfn_valid s always called at the start of each pageblock.
Architectures with holes in pageblocks will be correctly handled by
pfn_valid_within in pageblock_is_reserved.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Dang Bo <bdang@vmware.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Arve Hjnnevg <arve@android.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-12-21 12:57:36 -08:00
Youquan Song
b492a377ac thp: set compound tail page _count to zero
commit 58a84aa927 upstream.

Commit 70b50f94f1 ("mm: thp: tail page refcounting fix") keeps all
page_tail->_count zero at all times.  But the current kernel does not
set page_tail->_count to zero if a 1GB page is utilized.  So when an
IOMMU 1GB page is used by KVM, it wil result in a kernel oops because a
tail page's _count does not equal zero.

  kernel BUG at include/linux/mm.h:386!
  invalid opcode: 0000 [#1] SMP
  Call Trace:
    gup_pud_range+0xb8/0x19d
    get_user_pages_fast+0xcb/0x192
    ? trace_hardirqs_off+0xd/0xf
    hva_to_pfn+0x119/0x2f2
    gfn_to_pfn_memslot+0x2c/0x2e
    kvm_iommu_map_pages+0xfd/0x1c1
    kvm_iommu_map_memslots+0x7c/0xbd
    kvm_iommu_map_guest+0xaa/0xbf
    kvm_vm_ioctl_assigned_device+0x2ef/0xa47
    kvm_vm_ioctl+0x36c/0x3a2
    do_vfs_ioctl+0x49e/0x4e4
    sys_ioctl+0x5a/0x7c
    system_call_fastpath+0x16/0x1b
  RIP  gup_huge_pud+0xf2/0x159

Signed-off-by: Youquan Song <youquan.song@intel.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-12-21 12:57:35 -08:00
Hillf Danton
d98627dd4a hugetlb: release pages in the error path of hugetlb_cow()
commit ea4039a34c upstream.

If we fail to prepare an anon_vma, the {new, old}_page should be released,
or they will leak.

Signed-off-by: Hillf Danton <dhillf@gmail.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-12-09 08:52:37 -08:00
Rabin Vincent
1f36bbbe41 backing-dev: ensure wakeup_timer is deleted
commit 7a401a972d upstream.

bdi_prune_sb() in bdi_unregister() attempts to removes the bdi links
from all super_blocks and then del_timer_sync() the writeback timer.

However, this can race with __mark_inode_dirty(), leading to
bdi_wakeup_thread_delayed() rearming the writeback timer on the bdi
we're unregistering, after we've called del_timer_sync().

This can end up with the bdi being freed with an active timer inside it,
as in the case of the following dump after the removal of an SD card.

Fix this by redoing the del_timer_sync() in bdi_destory().

 ------------[ cut here ]------------
 WARNING: at /home/rabin/kernel/arm/lib/debugobjects.c:262 debug_print_object+0x9c/0xc8()
 ODEBUG: free active (active state 0) object type: timer_list hint: wakeup_timer_fn+0x0/0x180
 Modules linked in:
 Backtrace:
 [<c00109dc>] (dump_backtrace+0x0/0x110) from [<c0236e4c>] (dump_stack+0x18/0x1c)
  r6:c02bc638 r5:00000106 r4:c79f5d18 r3:00000000
 [<c0236e34>] (dump_stack+0x0/0x1c) from [<c0025e6c>] (warn_slowpath_common+0x54/0x6c)
 [<c0025e18>] (warn_slowpath_common+0x0/0x6c) from [<c0025f28>] (warn_slowpath_fmt+0x38/0x40)
  r8:20000013 r7:c780c6f0 r6:c031613c r5:c780c6f0 r4:c02b1b29
 r3:00000009
 [<c0025ef0>] (warn_slowpath_fmt+0x0/0x40) from [<c015eb4c>] (debug_print_object+0x9c/0xc8)
  r3:c02b1b29 r2:c02bc662
 [<c015eab0>] (debug_print_object+0x0/0xc8) from [<c015f574>] (debug_check_no_obj_freed+0xac/0x1dc)
  r6:c7964000 r5:00000001 r4:c7964000
 [<c015f4c8>] (debug_check_no_obj_freed+0x0/0x1dc) from [<c00a9e38>] (kmem_cache_free+0x88/0x1f8)
 [<c00a9db0>] (kmem_cache_free+0x0/0x1f8) from [<c014286c>] (blk_release_queue+0x70/0x78)
 [<c01427fc>] (blk_release_queue+0x0/0x78) from [<c015290c>] (kobject_release+0x70/0x84)
  r5:c79641f0 r4:c796420c
 [<c015289c>] (kobject_release+0x0/0x84) from [<c0153ce4>] (kref_put+0x68/0x80)
  r7:00000083 r6:c74083d0 r5:c015289c r4:c796420c
 [<c0153c7c>] (kref_put+0x0/0x80) from [<c01527d0>] (kobject_put+0x48/0x5c)
  r5:c79643b4 r4:c79641f0
 [<c0152788>] (kobject_put+0x0/0x5c) from [<c013ddd8>] (blk_cleanup_queue+0x68/0x74)
  r4:c7964000
 [<c013dd70>] (blk_cleanup_queue+0x0/0x74) from [<c01a6370>] (mmc_blk_put+0x78/0xe8)
  r5:00000000 r4:c794c400
 [<c01a62f8>] (mmc_blk_put+0x0/0xe8) from [<c01a64b4>] (mmc_blk_release+0x24/0x38)
  r5:c794c400 r4:c0322824
 [<c01a6490>] (mmc_blk_release+0x0/0x38) from [<c00de11c>] (__blkdev_put+0xe8/0x170)
  r5:c78d5e00 r4:c74083c0
 [<c00de034>] (__blkdev_put+0x0/0x170) from [<c00de2c0>] (blkdev_put+0x11c/0x12c)
  r8:c79f5f70 r7:00000001 r6:c74083d0 r5:00000083 r4:c74083c0
 r3:00000000
 [<c00de1a4>] (blkdev_put+0x0/0x12c) from [<c00b0724>] (kill_block_super+0x60/0x6c)
  r7:c7942300 r6:c79f4000 r5:00000083 r4:c74083c0
 [<c00b06c4>] (kill_block_super+0x0/0x6c) from [<c00b0a94>] (deactivate_locked_super+0x44/0x70)
  r6:c79f4000 r5:c031af64 r4:c794dc00 r3:c00b06c4
 [<c00b0a50>] (deactivate_locked_super+0x0/0x70) from [<c00b1358>] (deactivate_super+0x6c/0x70)
  r5:c794dc00 r4:c794dc00
 [<c00b12ec>] (deactivate_super+0x0/0x70) from [<c00c88b0>] (mntput_no_expire+0x188/0x194)
  r5:c794dc00 r4:c7942300
 [<c00c8728>] (mntput_no_expire+0x0/0x194) from [<c00c95e0>] (sys_umount+0x2e4/0x310)
  r6:c7942300 r5:00000000 r4:00000000 r3:00000000
 [<c00c92fc>] (sys_umount+0x0/0x310) from [<c000d940>] (ret_fast_syscall+0x0/0x30)
 ---[ end trace e5c83c92ada51c76 ]---

Signed-off-by: Rabin Vincent <rabin.vincent@stericsson.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-21 14:31:25 -08:00
Andrea Arcangeli
68fe9d9c79 mm: thp: tail page refcounting fix
commit 70b50f94f1 upstream.

Michel while working on the working set estimation code, noticed that
calling get_page_unless_zero() on a random pfn_to_page(random_pfn)
wasn't safe, if the pfn ended up being a tail page of a transparent
hugepage under splitting by __split_huge_page_refcount().

He then found the problem could also theoretically materialize with
page_cache_get_speculative() during the speculative radix tree lookups
that uses get_page_unless_zero() in SMP if the radix tree page is freed
and reallocated and get_user_pages is called on it before
page_cache_get_speculative has a chance to call get_page_unless_zero().

So the best way to fix the problem is to keep page_tail->_count zero at
all times.  This will guarantee that get_page_unless_zero() can never
succeed on any tail page.  page_tail->_mapcount is guaranteed zero and
is unused for all tail pages of a compound page, so we can simply
account the tail page references there and transfer them to
tail_page->_count in __split_huge_page_refcount() (in addition to the
head_page->_mapcount).

While debugging this s/_count/_mapcount/ change I also noticed get_page is
called by direct-io.c on pages returned by get_user_pages.  That wasn't
entirely safe because the two atomic_inc in get_page weren't atomic.  As
opposed to other get_user_page users like secondary-MMU page fault to
establish the shadow pagetables would never call any superflous get_page
after get_user_page returns.  It's safer to make get_page universally safe
for tail pages and to use get_page_foll() within follow_page (inside
get_user_pages()).  get_page_foll() is safe to do the refcounting for tail
pages without taking any locks because it is run within PT lock protected
critical sections (PT lock for pte and page_table_lock for
pmd_trans_huge).

The standard get_page() as invoked by direct-io instead will now take
the compound_lock but still only for tail pages.  The direct-io paths
are usually I/O bound and the compound_lock is per THP so very
finegrined, so there's no risk of scalability issues with it.  A simple
direct-io benchmarks with all lockdep prove locking and spinlock
debugging infrastructure enabled shows identical performance and no
overhead.  So it's worth it.  Ideally direct-io should stop calling
get_page() on pages returned by get_user_pages().  The spinlock in
get_page() is already optimized away for no-THP builds but doing
get_page() on tail pages returned by GUP is generally a rare operation
and usually only run in I/O paths.

This new refcounting on page_tail->_mapcount in addition to avoiding new
RCU critical sections will also allow the working set estimation code to
work without any further complexity associated to the tail page
refcounting with THP.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11 09:36:29 -08:00
Mitsuo Hayasaka
38af8ddb4b mm: avoid null pointer access in vm_struct via /proc/vmallocinfo
commit f5252e009d upstream.

The /proc/vmallocinfo shows information about vmalloc allocations in
vmlist that is a linklist of vm_struct.  It, however, may access pages
field of vm_struct where a page was not allocated.  This results in a null
pointer access and leads to a kernel panic.

Why this happens: In __vmalloc_node_range() called from vmalloc(), newly
allocated vm_struct is added to vmlist at __get_vm_area_node() and then,
some fields of vm_struct such as nr_pages and pages are set at
__vmalloc_area_node().  In other words, it is added to vmlist before it is
fully initialized.  At the same time, when the /proc/vmallocinfo is read,
it accesses the pages field of vm_struct according to the nr_pages field
at show_numa_info().  Thus, a null pointer access happens.

The patch adds the newly allocated vm_struct to the vmlist *after* it is
fully initialized.  So, it can avoid accessing the pages field with
unallocated page when show_numa_info() is called.

Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11 09:36:18 -08:00
Hugh Dickins
6b2f66dc9c mm: fix race between mremap and removing migration entry
commit 486cf46f3f upstream.

I don't usually pay much attention to the stale "? " addresses in
stack backtraces, but this lucky report from Pawel Sikora hints that
mremap's move_ptes() has inadequate locking against page migration.

 3.0 BUG_ON(!PageLocked(p)) in migration_entry_to_page():
 kernel BUG at include/linux/swapops.h:105!
 RIP: 0010:[<ffffffff81127b76>]  [<ffffffff81127b76>]
                       migration_entry_wait+0x156/0x160
  [<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
  [<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
  [<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
  [<ffffffff81102a31>] handle_mm_fault+0x181/0x310
  [<ffffffff81106097>] ? vma_adjust+0x537/0x570
  [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
  [<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
  [<ffffffff81421d5f>] page_fault+0x1f/0x30

mremap's down_write of mmap_sem, together with i_mmap_mutex or lock,
and pagetable locks, were good enough before page migration (with its
requirement that every migration entry be found) came in, and enough
while migration always held mmap_sem; but not enough nowadays, when
there's memory hotremove and compaction.

The danger is that move_ptes() lets a migration entry dodge around
behind remove_migration_pte()'s back, so it's in the old location when
looking at the new, then in the new location when looking at the old.

Either mremap's move_ptes() must additionally take anon_vma lock(), or
migration's remove_migration_pte() must stop peeking for is_swap_entry()
before it takes pagetable lock.

Consensus chooses the latter: we prefer to add overhead to migration
than to mremapping, which gets used by JVMs and by exec stack setup.

Reported-and-tested-by: Paweł Sikora <pluto@agmk.net>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-25 07:10:17 +02:00
KAMEZAWA Hiroyuki
e19fe03417 memcg: fix vmscan count in small memcgs
commit 4508378b95 upstream.

Commit 246e87a939 ("memcg: fix get_scan_count() for small targets")
fixes the memcg/kswapd behavior against small targets and prevent vmscan
priority too high.

But the implementation is too naive and adds another problem to small
memcg.  It always force scan to 32 pages of file/anon and doesn't handle
swappiness and other rotate_info.  It makes vmscan to scan anon LRU
regardless of swappiness and make reclaim bad.  This patch fixes it by
adjusting scanning count with regard to swappiness at el.

At a test "cat 1G file under 300M limit." (swappiness=20)
 before patch
        scanned_pages_by_limit 360919
        scanned_anon_pages_by_limit 180469
        scanned_file_pages_by_limit 180450
        rotated_pages_by_limit 31
        rotated_anon_pages_by_limit 25
        rotated_file_pages_by_limit 6
        freed_pages_by_limit 180458
        freed_anon_pages_by_limit 19
        freed_file_pages_by_limit 180439
        elapsed_ns_by_limit 429758872
 after patch
        scanned_pages_by_limit 180674
        scanned_anon_pages_by_limit 24
        scanned_file_pages_by_limit 180650
        rotated_pages_by_limit 35
        rotated_anon_pages_by_limit 24
        rotated_file_pages_by_limit 11
        freed_pages_by_limit 180634
        freed_anon_pages_by_limit 0
        freed_file_pages_by_limit 180634
        elapsed_ns_by_limit 367119089
        scanned_pages_by_system 0

the numbers of scanning anon are decreased(as expected), and elapsed time
reduced. By this patch, small memcgs will work better.
(*) Because the amount of file-cache is much bigger than anon,
    recalaim_stat's rotate-scan counter make scanning files more.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03 11:41:08 -07:00
Wu Fengguang
ac693061b1 writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage
commit 6e6938b6d3 upstream.

sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
WB_SYNC_ALL sync. Identify the first stage with .tagged_writepages and
do livelock prevention for it, too.

Jan's commit f446daaea9 ("mm: implement writeback livelock avoidance
using page tagging") is a partial fix in that it only fixed the
WB_SYNC_ALL phase livelock.

Although ext4 is tested to no longer livelock with commit f446daaea9,
it may due to some "redirty_tail() after pages_skipped" effect which
is by no means a guarantee for _all_ the file systems.

Note that writeback_inodes_sb() is called by not only sync(), they are
treated the same because the other callers also need livelock prevention.

Impact:  It changes the order in which pages/inodes are synced to disk.
Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode
until finished with the current inode.

Acked-by: Jan Kara <jack@suse.cz>
CC: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03 11:40:43 -07:00
David Vrabel
d63c8a029e mm: sync vmalloc address space page tables in alloc_vm_area()
commit 461ae488ec upstream.

Xen backend drivers (e.g., blkback and netback) would sometimes fail to
map grant pages into the vmalloc address space allocated with
alloc_vm_area().  The GNTTABOP_map_grant_ref would fail because Xen could
not find the page (in the L2 table) containing the PTEs it needed to
update.

(XEN) mm.c:3846:d0 Could not find L1 PTE for address fbb42000

netback and blkback were making the hypercall from a kernel thread where
task->active_mm != &init_mm and alloc_vm_area() was only updating the page
tables for init_mm.  The usual method of deferring the update to the page
tables of other processes (i.e., after taking a fault) doesn't work as a
fault cannot occur during the hypercall.

This would work on some systems depending on what else was using vmalloc.

Fix this by reverting ef691947d8 ("vmalloc: remove vmalloc_sync_all()
from alloc_vm_area()") and add a comment to explain why it's needed.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Keir Fraser <keir.xen@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03 11:40:30 -07:00
Mel Gorman
66d52cb7c4 mm: page allocator: reconsider zones for allocation after direct reclaim
commit 76d3fbf8fb upstream.

With zone_reclaim_mode enabled, it's possible for zones to be considered
full in the zonelist_cache so they are skipped in the future.  If the
process enters direct reclaim, the ZLC may still consider zones to be full
even after reclaiming pages.  Reconsider all zones for allocation if
direct reclaim returns successfully.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stefan Priebe <s.priebe@profihost.ag>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03 11:40:04 -07:00
Mel Gorman
42274b5f81 mm: page allocator: initialise ZLC for first zone eligible for zone_reclaim
commit cd38b115d5 upstream.

There have been a small number of complaints about significant stalls
while copying large amounts of data on NUMA machines reported on a
distribution bugzilla.  In these cases, zone_reclaim was enabled by
default due to large NUMA distances.  In general, the complaints have not
been about the workload itself unless it was a file server (in which case
the recommendation was disable zone_reclaim).

The stalls are mostly due to significant amounts of time spent scanning
the preferred zone for pages to free.  After a failure, it might fallback
to another node (as zonelists are often node-ordered rather than
zone-ordered) but stall quickly again when the next allocation attempt
occurs.  In bad cases, each page allocated results in a full scan of the
preferred zone.

Patch 1 checks the preferred zone for recent allocation failure
        which is particularly important if zone_reclaim has failed
        recently.  This avoids rescanning the zone in the near future and
        instead falling back to another node.  This may hurt node locality
        in some cases but a failure to zone_reclaim is more expensive than
        a remote access.

Patch 2 clears the zlc information after direct reclaim.
        Otherwise, zone_reclaim can mark zones full, direct reclaim can
        reclaim enough pages but the zone is still not considered for
        allocation.

This was tested on a 24-thread 2-node x86_64 machine.  The tests were
focused on large amounts of IO.  All tests were bound to the CPUs on
node-0 to avoid disturbances due to processes being scheduled on different
nodes.  The kernels tested are

3.0-rc6-vanilla		Vanilla 3.0-rc6
zlcfirst		Patch 1 applied
zlcreconsider		Patches 1+2 applied

FS-Mark
./fs_mark  -d  /tmp/fsmark-10813  -D  100  -N  5000  -n  208  -L  35  -t  24  -S0  -s  524288
                fsmark-3.0-rc6       3.0-rc6       		3.0-rc6
                   vanilla			 zlcfirs 	zlcreconsider
Files/s  min          54.90 ( 0.00%)       49.80 (-10.24%)       49.10 (-11.81%)
Files/s  mean        100.11 ( 0.00%)      135.17 (25.94%)      146.93 (31.87%)
Files/s  stddev       57.51 ( 0.00%)      138.97 (58.62%)      158.69 (63.76%)
Files/s  max         361.10 ( 0.00%)      834.40 (56.72%)      802.40 (55.00%)
Overhead min       76704.00 ( 0.00%)    76501.00 ( 0.27%)    77784.00 (-1.39%)
Overhead mean    1485356.51 ( 0.00%)  1035797.83 (43.40%)  1594680.26 (-6.86%)
Overhead stddev  1848122.53 ( 0.00%)   881489.88 (109.66%)  1772354.90 ( 4.27%)
Overhead max     7989060.00 ( 0.00%)  3369118.00 (137.13%) 10135324.00 (-21.18%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds)        501.49    493.91    499.93
Total Elapsed Time (seconds)               2451.57   2257.48   2215.92

MMTests Statistics: vmstat
Page Ins                                       46268       63840       66008
Page Outs                                   90821596    90671128    88043732
Swap Ins                                           0           0           0
Swap Outs                                          0           0           0
Direct pages scanned                        13091697     8966863     8971790
Kswapd pages scanned                               0     1830011     1831116
Kswapd pages reclaimed                             0     1829068     1829930
Direct pages reclaimed                      13037777     8956828     8648314
Kswapd efficiency                               100%         99%         99%
Kswapd velocity                                0.000     810.643     826.346
Direct efficiency                                99%         99%         96%
Direct velocity                             5340.128    3972.068    4048.788
Percentage direct scans                         100%         83%         83%
Page writes by reclaim                             0           3           0
Slabs scanned                                 796672      720640      720256
Direct inode steals                          7422667     7160012     7088638
Kswapd inode steals                                0     1736840     2021238

Test completes far faster with a large increase in the number of files
created per second.  Standard deviation is high as a small number of
iterations were much higher than the mean.  The number of pages scanned by
zone_reclaim is reduced and kswapd is used for more work.

LARGE DD
               		3.0-rc6       3.0-rc6       3.0-rc6
                   	vanilla     zlcfirst     zlcreconsider
download tar           59 ( 0.00%)   59 ( 0.00%)   55 ( 7.27%)
dd source files       527 ( 0.00%)  296 (78.04%)  320 (64.69%)
delete source          36 ( 0.00%)   19 (89.47%)   20 (80.00%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds)        125.03    118.98    122.01
Total Elapsed Time (seconds)                624.56    375.02    398.06

MMTests Statistics: vmstat
Page Ins                                     3594216      439368      407032
Page Outs                                   23380832    23380488    23377444
Swap Ins                                           0           0           0
Swap Outs                                          0         436         287
Direct pages scanned                        17482342    69315973    82864918
Kswapd pages scanned                               0      519123      575425
Kswapd pages reclaimed                             0      466501      522487
Direct pages reclaimed                       5858054     2732949     2712547
Kswapd efficiency                               100%         89%         90%
Kswapd velocity                                0.000    1384.254    1445.574
Direct efficiency                                33%          3%          3%
Direct velocity                            27991.453  184832.737  208171.929
Percentage direct scans                         100%         99%         99%
Page writes by reclaim                             0        5082       13917
Slabs scanned                                  17280       29952       35328
Direct inode steals                           115257     1431122      332201
Kswapd inode steals                                0           0      979532

This test downloads a large tarfile and copies it with dd a number of
times - similar to the most recent bug report I've dealt with.  Time to
completion is reduced.  The number of pages scanned directly is still
disturbingly high with a low efficiency but this is likely due to the
number of dirty pages encountered.  The figures could probably be improved
with more work around how kswapd is used and how dirty pages are handled
but that is separate work and this result is significant on its own.

Streaming Mapped Writer
MMTests Statistics: duration
User/Sys Time Running Test (seconds)        124.47    111.67    112.64
Total Elapsed Time (seconds)               2138.14   1816.30   1867.56

MMTests Statistics: vmstat
Page Ins                                       90760       89124       89516
Page Outs                                  121028340   120199524   120736696
Swap Ins                                           0          86          55
Swap Outs                                          0           0           0
Direct pages scanned                       114989363    96461439    96330619
Kswapd pages scanned                        56430948    56965763    57075875
Kswapd pages reclaimed                      27743219    27752044    27766606
Direct pages reclaimed                         49777       46884       36655
Kswapd efficiency                                49%         48%         48%
Kswapd velocity                            26392.541   31363.631   30561.736
Direct efficiency                                 0%          0%          0%
Direct velocity                            53780.091   53108.759   51581.004
Percentage direct scans                          67%         62%         62%
Page writes by reclaim                           385         122        1513
Slabs scanned                                  43008       39040       42112
Direct inode steals                                0          10           8
Kswapd inode steals                              733         534         477

This test just creates a large file mapping and writes to it linearly.
Time to completion is again reduced.

The gains are mostly down to two things.  In many cases, there is less
scanning as zone_reclaim simply gives up faster due to recent failures.
The second reason is that memory is used more efficiently.  Instead of
scanning the preferred zone every time, the allocator falls back to
another zone and uses it instead improving overall memory utilisation.

This patch: initialise ZLC for first zone eligible for zone_reclaim.

The zonelist cache (ZLC) is used among other things to record if
zone_reclaim() failed for a particular zone recently.  The intention is to
avoid a high cost scanning extremely long zonelists or scanning within the
zone uselessly.

Currently the zonelist cache is setup only after the first zone has been
considered and zone_reclaim() has been called.  The objective was to avoid
a costly setup but zone_reclaim is itself quite expensive.  If it is
failing regularly such as the first eligible zone having mostly mapped
pages, the cost in scanning and allocation stalls is far higher than the
ZLC initialisation step.

This patch initialises ZLC before the first eligible zone calls
zone_reclaim().  Once initialised, it is checked whether the zone failed
zone_reclaim recently.  If it has, the zone is skipped.  As the first zone
is now being checked, additional care has to be taken about zones marked
full.  A zone can be marked "full" because it should not have enough
unmapped pages for zone_reclaim but this is excessive as direct reclaim or
kswapd may succeed where zone_reclaim fails.  Only mark zones "full" after
zone_reclaim fails if it failed to reclaim enough pages after scanning.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stefan Priebe <s.priebe@profihost.ag>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03 11:40:04 -07:00
Clemens Ladisch
2366d7cc46 mm: fix wrong vmap address calculations with odd NR_CPUS values
commit f982f91516 upstream.

Commit db64fe0225 ("mm: rewrite vmap layer") introduced code that does
address calculations under the assumption that VMAP_BLOCK_SIZE is a
power of two.  However, this might not be true if CONFIG_NR_CPUS is not
set to a power of two.

Wrong vmap_block index/offset values could lead to memory corruption.
However, this has never been observed in practice (or never been
diagnosed correctly); what caught this was the BUG_ON in vb_alloc() that
checks for inconsistent vmap_block indices.

To fix this, ensure that VMAP_BLOCK_SIZE always is a power of two.

BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=31572
Reported-by: Pavel Kysilka <goldenfish@linuxsoft.cz>
Reported-by: Matias A. Fonzo <selk@dragora.org>
Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Krzysztof Helt <krzysztof.h1@poczta.fm>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-17 10:55:53 -07:00
Oleg Nesterov
84416db676 oom: task->mm == NULL doesn't mean the memory was freed
commit c027a474a6 upstream.

exit_mm() sets ->mm == NULL then it does mmput()->exit_mmap() which
frees the memory.

However select_bad_process() checks ->mm != NULL before TIF_MEMDIE,
so it continues to kill other tasks even if we have the oom-killed
task freeing its memory.

Change select_bad_process() to check ->mm after TIF_MEMDIE, but skip
the tasks which have already passed exit_notify() to ensure a zombie
with TIF_MEMDIE set can't block oom-killer. Alternatively we could
probably clear TIF_MEMDIE after exit_mmap().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-04 21:58:42 -07:00
Daisuke Nishimura
a6f0411fbd memcg: fix behavior of mem_cgroup_resize_limit()
commit 108b6a7846 upstream.

Commit 22a668d7c3 ("memcg: fix behavior under memory.limit equals to
memsw.limit") introduced "memsw_is_minimum" flag, which becomes true
when mem_limit == memsw_limit.  The flag is checked at the beginning of
reclaim, and "noswap" is set if the flag is true, because using swap is
meaningless in this case.

This works well in most cases, but when we try to shrink mem_limit,
which is the same as memsw_limit now, we might fail to shrink mem_limit
because swap doesn't used.

This patch fixes this behavior by:
 - check MEM_CGROUP_RECLAIM_SHRINK at the begining of reclaim
 - If it is set, don't set "noswap" flag even if memsw_is_minimum is true.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <bsingharora@gmail.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-04 21:58:41 -07:00
Peter Zijlstra
650957da76 mm/backing-dev.c: reset bdi min_ratio in bdi_unregister()
commit ccb6108f5b upstream.

Vito said:

: The system has many usb disks coming and going day to day, with their
: respective bdi's having min_ratio set to 1 when inserted.  It works for
: some time until eventually min_ratio can no longer be set, even when the
: active set of bdi's seen in /sys/class/bdi/*/min_ratio doesn't add up to
: anywhere near 100.
:
: This then leads to an unrelated starvation problem caused by write-heavy
: fuse mounts being used atop the usb disks, a problem the min_ratio setting
: at the underlying devices bdi effectively prevents.

Fix this leakage by resetting the bdi min_ratio when unregistering the
BDI.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Reported-by: Vito Caputo <lkml@pengaru.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-04 21:58:39 -07:00
Benjamin Herrenschmidt
b045b9a265 mm/futex: fix futex writes on archs with SW tracking of dirty & young
commit 2efaca927f upstream.

I haven't reproduced it myself but the fail scenario is that on such
machines (notably ARM and some embedded powerpc), if you manage to hit
that futex path on a writable page whose dirty bit has gone from the PTE,
you'll livelock inside the kernel from what I can tell.

It will go in a loop of trying the atomic access, failing, trying gup to
"fix it up", getting succcess from gup, go back to the atomic access,
failing again because dirty wasn't fixed etc...

So I think you essentially hang in the kernel.

The scenario is probably rare'ish because affected architecture are
embedded and tend to not swap much (if at all) so we probably rarely hit
the case where dirty is missing or young is missing, but I think Shan has
a piece of SW that can reliably reproduce it using a shared writable
mapping & fork or something like that.

On archs who use SW tracking of dirty & young, a page without dirty is
effectively mapped read-only and a page without young unaccessible in the
PTE.

Additionally, some architectures might lazily flush the TLB when relaxing
write protection (by doing only a local flush), and expect a fault to
invalidate the stale entry if it's still present on another processor.

The futex code assumes that if the "in_atomic()" access -EFAULT's, it can
"fix it up" by causing get_user_pages() which would then be equivalent to
taking the fault.

However that isn't the case.  get_user_pages() will not call
handle_mm_fault() in the case where the PTE seems to have the right
permissions, regardless of the dirty and young state.  It will eventually
update those bits ...  in the struct page, but not in the PTE.

Additionally, it will not handle the lazy TLB flushing that can be
required by some architectures in the fault case.

Basically, gup is the wrong interface for the job.  The patch provides a
more appropriate one which boils down to just calling handle_mm_fault()
since what we are trying to do is simulate a real page fault.

The futex code currently attempts to write to user memory within a
pagefault disabled section, and if that fails, tries to fix it up using
get_user_pages().

This doesn't work on archs where the dirty and young bits are maintained
by software, since they will gate access permission in the TLB, and will
not be updated by gup().

In addition, there's an expectation on some archs that a spurious write
fault triggers a local TLB flush, and that is missing from the picture as
well.

I decided that adding those "features" to gup() would be too much for this
already too complex function, and instead added a new simpler
fixup_user_fault() which is essentially a wrapper around handle_mm_fault()
which the futex code can call.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix some nits Darren saw, fiddle comment layout]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reported-by: Shan Hai <haishan.bai@gmail.com>
Tested-by: Shan Hai <haishan.bai@gmail.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Darren Hart <darren.hart@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-04 21:58:38 -07:00
Shaohua Li
4746efded8 vmscan: fix a livelock in kswapd
I'm running a workload which triggers a lot of swap in a machine with 4
nodes.  After I kill the workload, I found a kswapd livelock.  Sometimes
kswapd3 or kswapd2 are keeping running and I can't access filesystem,
but most memory is free.

This looks like a regression since commit 08951e5459 ("mm: vmscan:
correct check for kswapd sleeping in sleeping_prematurely").

Node 2 and 3 have only ZONE_NORMAL, but balance_pgdat() will return 0
for classzone_idx.  The reason is end_zone in balance_pgdat() is 0 by
default, if all zones have watermark ok, end_zone will keep 0.

Later sleeping_prematurely() always returns true.  Because this is an
order 3 wakeup, and if classzone_idx is 0, both balanced_pages and
present_pages in pgdat_balanced() are 0.  We add a special case here.
If a zone has no page, we think it's balanced.  This fixes the livelock.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-19 22:09:31 -07:00
Bob Liu
8f3b1327aa mm/nommu.c: fix remap_pfn_range()
remap_pfn_range() means map physical address pfn<<PAGE_SHIFT to user addr.

For nommu arch it's implemented by vma->vm_start = pfn << PAGE_SHIFT which
is wrong acroding the original meaning of this function.  And some driver
developer using remap_pfn_range() with correct parameter will get
unexpected result because vm_start is changed.  It should be implementd
like addr = pfn << PAGE_SHIFT but which is meanless on nommu arch, this
patch just make it simply return.

Parameter name and setting of vma->vm_flags also be fixed.

Signed-off-by: Bob Liu <lliubbo@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: David Howells <dhowells@redhat.com>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Bob Liu <lliubbo@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-08 21:14:44 -07:00
KAMEZAWA Hiroyuki
453a9bf347 memcg: fix numa scan information update to be triggered by memory event
commit 889976dbcb ("memcg: reclaim memory from nodes in round-robin
order") adds an numa node round-robin for memcg.  But the information is
updated once per 10sec.

This patch changes the update trigger from jiffies to memcg's event count.
 After this patch, numa scan information will be updated when we see 1024
events of pagein/pageout under a memcg.

[akpm@linux-foundation.org: attempt to repair code layout]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ying Han <yinghan@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-08 21:14:44 -07:00
KAMEZAWA Hiroyuki
4d0c066d29 memcg: fix reclaimable lru check in memcg
Now, in mem_cgroup_hierarchical_reclaim(), mem_cgroup_local_usage() is
used for checking whether the memcg contains reclaimable pages or not.  If
no pages in it, the routine skips it.

But, mem_cgroup_local_usage() contains Unevictable pages and cannot handle
"noswap" condition correctly.  This doesn't work on a swapless system.

This patch adds test_mem_cgroup_reclaimable() and replaces
mem_cgroup_local_usage().  test_mem_cgroup_reclaimable() see LRU counter
and returns correct answer to the caller.  And this new function has
"noswap" argument and can see only FILE LRU if necessary.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix kerneldoc layout]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ying Han <yinghan@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-08 21:14:43 -07:00
Shaohua Li
0b43c3aab0 mm: __tlb_remove_page() check the correct batch
__tlb_remove_page() switches to a new batch page, but still checks space
in the old batch.  This check always fails, and causes a forced tlb flush.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-08 21:14:43 -07:00
Mel Gorman
215ddd6664 mm: vmscan: only read new_classzone_idx from pgdat when reclaiming successfully
During allocator-intensive workloads, kswapd will be woken frequently
causing free memory to oscillate between the high and min watermark.  This
is expected behaviour.  Unfortunately, if the highest zone is small, a
problem occurs.

When balance_pgdat() returns, it may be at a lower classzone_idx than it
started because the highest zone was unreclaimable.  Before checking if it
should go to sleep though, it checks pgdat->classzone_idx which when there
is no other activity will be MAX_NR_ZONES-1.  It interprets this as it has
been woken up while reclaiming, skips scheduling and reclaims again.  As
there is no useful reclaim work to do, it enters into a loop of shrinking
slab consuming loads of CPU until the highest zone becomes reclaimable for
a long period of time.

There are two problems here.  1) If the returned classzone or order is
lower, it'll continue reclaiming without scheduling.  2) if the highest
zone was marked unreclaimable but balance_pgdat() returns immediately at
DEF_PRIORITY, the new lower classzone is not communicated back to kswapd()
for sleeping.

This patch does two things that are related.  If the end_zone is
unreclaimable, this information is communicated back.  Second, if the
classzone or order was reduced due to failing to reclaim, new information
is not read from pgdat and instead an attempt is made to go to sleep.  Due
to this, it is also necessary that pgdat->classzone_idx be initialised
each time to pgdat->nr_zones - 1 to avoid re-reads being interpreted as
wakeups.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Pádraig Brady <P@draigBrady.com>
Tested-by: Pádraig Brady <P@draigBrady.com>
Tested-by: Andrew Lutomirski <luto@mit.edu>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-08 21:14:43 -07:00
Mel Gorman
da175d06b4 mm: vmscan: evaluate the watermarks against the correct classzone
When deciding if kswapd is sleeping prematurely, the classzone is taken
into account but this is different to what balance_pgdat() and the
allocator are doing.  Specifically, the DMA zone will be checked based on
the classzone used when waking kswapd which could be for a GFP_KERNEL or
GFP_HIGHMEM request.  The lowmem reserve limit kicks in, the watermark is
not met and kswapd thinks it's sleeping prematurely keeping kswapd awake in
error.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Pádraig Brady <P@draigBrady.com>
Tested-by: Pádraig Brady <P@draigBrady.com>
Tested-by: Andrew Lutomirski <luto@mit.edu>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-08 21:14:43 -07:00
Mel Gorman
d7868dae89 mm: vmscan: do not apply pressure to slab if we are not applying pressure to zone
During allocator-intensive workloads, kswapd will be woken frequently
causing free memory to oscillate between the high and min watermark.  This
is expected behaviour.

When kswapd applies pressure to zones during node balancing, it checks if
the zone is above a high+balance_gap threshold.  If it is, it does not
apply pressure but it unconditionally shrinks slab on a global basis which
is excessive.  In the event kswapd is being kept awake due to a high small
unreclaimable zone, it skips zone shrinking but still calls shrink_slab().

Once pressure has been applied, the check for zone being unreclaimable is
being made before the check is made if all_unreclaimable should be set.
This miss of unreclaimable can cause has_under_min_watermark_zone to be
set due to an unreclaimable zone preventing kswapd backing off on
congestion_wait().

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Pádraig Brady <P@draigBrady.com>
Tested-by: Pádraig Brady <P@draigBrady.com>
Tested-by: Andrew Lutomirski <luto@mit.edu>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-08 21:14:43 -07:00
Mel Gorman
08951e5459 mm: vmscan: correct check for kswapd sleeping in sleeping_prematurely
During allocator-intensive workloads, kswapd will be woken frequently
causing free memory to oscillate between the high and min watermark.  This
is expected behaviour.  Unfortunately, if the highest zone is small, a
problem occurs.

This seems to happen most with recent sandybridge laptops but it's
probably a co-incidence as some of these laptops just happen to have a
small Normal zone.  The reproduction case is almost always during copying
large files that kswapd pegs at 100% CPU until the file is deleted or
cache is dropped.

The problem is mostly down to sleeping_prematurely() keeping kswapd awake
when the highest zone is small and unreclaimable and compounded by the
fact we shrink slabs even when not shrinking zones causing a lot of time
to be spent in shrinkers and a lot of memory to be reclaimed.

Patch 1 corrects sleeping_prematurely to check the zones matching
	the classzone_idx instead of all zones.

Patch 2 avoids shrinking slab when we are not shrinking a zone.

Patch 3 notes that sleeping_prematurely is checking lower zones against
	a high classzone which is not what allocators or balance_pgdat()
	is doing leading to an artifical belief that kswapd should be
	still awake.

Patch 4 notes that when balance_pgdat() gives up on a high zone that the
	decision is not communicated to sleeping_prematurely()

This problem affects 2.6.38.8 for certain and is expected to affect 2.6.39
and 3.0-rc4 as well.  If accepted, they need to go to -stable to be picked
up by distros and this series is against 3.0-rc4.  I've cc'd people that
reported similar problems recently to see if they still suffer from the
problem and if this fixes it.

This patch: correct the check for kswapd sleeping in sleeping_prematurely()

During allocator-intensive workloads, kswapd will be woken frequently
causing free memory to oscillate between the high and min watermark.  This
is expected behaviour.

A problem occurs if the highest zone is small.  balance_pgdat() only
considers unreclaimable zones when priority is DEF_PRIORITY but
sleeping_prematurely considers all zones.  It's possible for this sequence
to occur

  1. kswapd wakes up and enters balance_pgdat()
  2. At DEF_PRIORITY, marks highest zone unreclaimable
  3. At DEF_PRIORITY-1, ignores highest zone setting end_zone
  4. At DEF_PRIORITY-1, calls shrink_slab freeing memory from
        highest zone, clearing all_unreclaimable. Highest zone
        is still unbalanced
  5. kswapd returns and calls sleeping_prematurely
  6. sleeping_prematurely looks at *all* zones, not just the ones
     being considered by balance_pgdat. The highest small zone
     has all_unreclaimable cleared but the zone is not
     balanced. all_zones_ok is false so kswapd stays awake

This patch corrects the behaviour of sleeping_prematurely to check the
zones balance_pgdat() checked.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Pádraig Brady <P@draigBrady.com>
Tested-by: Pádraig Brady <P@draigBrady.com>
Tested-by: Andrew Lutomirski <luto@mit.edu>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-08 21:14:42 -07:00
KAMEZAWA Hiroyuki
ac34a1a3c3 memcg: fix direct softlimit reclaim to be called in limit path
Commit d149e3b25d ("memcg: add the soft_limit reclaim in global direct
reclaim") adds a softlimit hook to shrink_zones().  By this, soft limit
is called as

   try_to_free_pages()
       do_try_to_free_pages()
           shrink_zones()
               mem_cgroup_soft_limit_reclaim()

Then, direct reclaim is memcg softlimit hint aware, now.

But, the memory cgroup's "limit" path can call softlimit shrinker.

   try_to_free_mem_cgroup_pages()
       do_try_to_free_pages()
           shrink_zones()
               mem_cgroup_soft_limit_reclaim()

This will cause a global reclaim when a memcg hits limit.

This is bug. soft_limit_reclaim() should be called when
scanning_global_lru(sc) == true.

And the commit adds a variable "total_scanned" for counting softlimit
scanned pages....it's not "total".  This patch removes the variable and
update sc->nr_scanned instead of it.  This will affect shrink_slab()'s
scan condition but, global LRU is scanned by softlimit and I think this
change makes sense.

TODO: avoid too much scanning of a zone when softlimit did enough work.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Ying Han <yinghan@google.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-27 18:00:13 -07:00
Jan Kara
08142579b6 mm: fix assertion mapping->nrpages == 0 in end_writeback()
Under heavy memory and filesystem load, users observe the assertion
mapping->nrpages == 0 in end_writeback() trigger.  This can be caused by
page reclaim reclaiming the last page from a mapping in the following
race:

	CPU0				CPU1
  ...
  shrink_page_list()
    __remove_mapping()
      __delete_from_page_cache()
        radix_tree_delete()
					evict_inode()
					  truncate_inode_pages()
					    truncate_inode_pages_range()
					      pagevec_lookup() - finds nothing
					  end_writeback()
					    mapping->nrpages != 0 -> BUG
        page->mapping = NULL
        mapping->nrpages--

Fix the problem by doing a reliable check of mapping->nrpages under
mapping->tree_lock in end_writeback().

Analyzed by Jay <jinshan.xiong@whamcloud.com>, lost in LKML, and dug out
by Miklos Szeredi <mszeredi@suse.de>.

Cc: Jay <jinshan.xiong@whamcloud.com>
Cc: Miklos Szeredi <mszeredi@suse.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-27 18:00:13 -07:00
Peter Zijlstra
9b679320a5 mm/memory-failure.c: fix spinlock vs mutex order
We cannot take a mutex while holding a spinlock, so flip the order and
fix the locking documentation.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-27 18:00:13 -07:00
Hugh Dickins
d9d90e5eb7 tmpfs: add shmem_read_mapping_page_gfp
Although it is used (by i915) on nothing but tmpfs, read_cache_page_gfp()
is unsuited to tmpfs, because it inserts a page into pagecache before
calling the filesystem's ->readpage: tmpfs may have pages in swapcache
which only it knows how to locate and switch to filecache.

At present tmpfs provides a ->readpage method, and copes with this by
copying pages; but soon we can simplify it by removing its ->readpage.
Provide shmem_read_mapping_page_gfp() now, ready for that transition,

Export shmem_read_mapping_page_gfp() and add it to list in shmem_fs.h,
with shmem_read_mapping_page() inline for the common mapping_gfp case.

(shmem_read_mapping_page_gfp or shmem_read_cache_page_gfp? Generally the
read_mapping_page functions use the mapping's ->readpage, and the
read_cache_page functions use the supplied filler, so I think
read_cache_page_gfp was slightly misnamed.)

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-27 18:00:12 -07:00
Hugh Dickins
94c1e62df4 tmpfs: take control of its truncate_range
2.6.35's new truncate convention gave tmpfs the opportunity to control
its file truncation, no longer enforced from outside by vmtruncate().
We shall want to build upon that, to handle pagecache and swap together.

Slightly redefine the ->truncate_range interface: let it now be called
between the unmap_mapping_range()s, with the filesystem responsible for
doing the truncate_inode_pages_range() from it - just as the filesystem
is nowadays responsible for doing that from its ->setattr.

Let's rename shmem_notify_change() to shmem_setattr().  Instead of
calling the generic truncate_setsize(), bring that code in so we can
call shmem_truncate_range() - which will later be updated to perform its
own variant of truncate_inode_pages_range().

Remove the punch_hole unmap_mapping_range() from shmem_truncate_range():
now that the COW's unmap_mapping_range() comes after ->truncate_range,
there is no need to call it a third time.

Export shmem_truncate_range() and add it to the list in shmem_fs.h, so
that i915_gem_object_truncate() can call it explicitly in future; get
this patch in first, then update drm/i915 once this is available (until
then, i915 will just be doing the truncate_inode_pages() twice).

Though introduced five years ago, no other filesystem is implementing
->truncate_range, and its only other user is madvise(,,MADV_REMOVE): we
expect to convert it to fallocate(,FALLOC_FL_PUNCH_HOLE,,) shortly,
whereupon ->truncate_range can be removed from inode_operations -
shmem_truncate_range() will help i915 across that transition too.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-27 18:00:12 -07:00